Write text here
We download AirBnB data from insideAirBnB.com; it was originally scraped from AirBnB.com. All of the AirBnB listings in Stockholm are a GZ file, namely they are archive files compressed by the standard GNU zip (gzip) compression algorithm. We can download, save and extract the file if we wanted, but vroom::vroom() or readr::read_csv() can immediately read and extract this kind of a file. We prefer vroom() as it is faster, it will download the *.gz zipped file, unzip, and provide us with the dataframe.
Let us have a look of the raw data that we downloaded and also provide some first summary statistics of the dataset.
#use head function to look at raw values
listings %>%
#don't show these columns because they have too many texts and would ruin the entire table
select(-description, -neighborhood_overview, -host_about, -amenities) %>%
head() %>%
#use kable and kableextra to format the table in our HTML output
kable(caption = "First few rows of our original dataset",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
fixed_thead = T) %>%
#use row_spec to bold the column titles
row_spec(0, bold = T) %>%
#use scroll box to save space
scroll_box(width = "100%")| id | listing_url | scrape_id | last_scraped | name | picture_url | host_id | host_url | host_name | host_since | host_location | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url | host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed | latitude | longitude | property_type | room_type | accommodates | bathrooms | bathrooms_text | bedrooms | beds | price | minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | calendar_updated | has_availability | availability_30 | availability_60 | availability_90 | availability_365 | calendar_last_scraped | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | first_review | last_review | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20083 | https://www.airbnb.com/rooms/20083 | 2.021093e+13 | 2021-09-30 | Unique in Södermalm SOFO Stockholm | https://a0.muscache.com/pictures/102900/423071ef_original.jpg | 75962 | https://www.airbnb.com/users/show/75962 | Lovisa | 2010-01-31 | Stockholm, Stockholm, Sweden | a few days or more | 0% | 0% | FALSE | https://a0.muscache.com/im/users/75962/profile_pic/1359284327/original.jpg?aki_policy=profile_small | https://a0.muscache.com/im/users/75962/profile_pic/1359284327/original.jpg?aki_policy=profile_x_medium | NA | 1 | 1 | [‘email’, ‘phone’, ‘reviews’, ‘jumio’, ‘government_id’] | TRUE | TRUE | NA | Södermalms | NA | 59.30833 | 18.08615 | Entire rental unit | Entire home/apt | 3 | NA | 1 bath | 1 | 1 | $1,052.00 | 4 | 28 | 4 | 4 | 28 | 28 | 4 | 28 | NA | TRUE | 29 | 59 | 89 | 364 | 2021-09-30 | 12 | 0 | 0 | 2010-03-28 | 2014-09-01 | 5.00 | 5.00 | 4.90 | 5.00 | 4.90 | 5.00 | 4.90 | NA | FALSE | 1 | 1 | 0 | 0 | 0.09 |
| 75590 | https://www.airbnb.com/rooms/75590 | 2.021093e+13 | 2021-09-30 | Amazing nature location by a lake | https://a0.muscache.com/pictures/7430cc80-7a4f-4642-8eca-46cfa917dd08.jpg | 397766 | https://www.airbnb.com/users/show/397766 | Peter | 2011-02-18 | Stockholm, Stockholm, Sweden | a few days or more | 0% | 0% | FALSE | https://a0.muscache.com/im/users/397766/profile_pic/1372944928/original.jpg?aki_policy=profile_small | https://a0.muscache.com/im/users/397766/profile_pic/1372944928/original.jpg?aki_policy=profile_x_medium | NA | 1 | 1 | [‘email’, ‘phone’, ‘facebook’, ‘reviews’, ‘jumio’, ‘government_id’, ‘work_email’] | TRUE | TRUE | Nacka, Stockholm County, Sweden | Skarpnäcks | NA | 59.30117 | 18.12833 | Entire rental unit | Entire home/apt | 3 | NA | 1 bath | 2 | 1 | $949.00 | 30 | 100 | 30 | 30 | 100 | 100 | 30 | 100 | NA | TRUE | 28 | 30 | 30 | 87 | 2021-09-30 | 10 | 0 | 0 | 2016-07-08 | 2015-07-11 | 4.80 | 5.00 | 4.89 | 4.89 | 5.00 | 4.78 | 4.78 | NA | FALSE | 1 | 1 | 0 | 0 | 0.16 |
| 155220 | https://www.airbnb.com/rooms/155220 | 2.021093e+13 | 2021-09-30 | Stockholm, new spacoius villa | https://a0.muscache.com/pictures/982440/c2bc38b4_original.jpg | 746396 | https://www.airbnb.com/users/show/746396 | Madeleine | 2011-06-26 | Stockholm, Stockholm County, Kingdom of Sweden | within a day | 88% | 44% | FALSE | https://a0.muscache.com/im/pictures/user/579244fb-bb65-443b-9c0c-de8cb62990ac.jpg?aki_policy=profile_small | https://a0.muscache.com/im/pictures/user/579244fb-bb65-443b-9c0c-de8cb62990ac.jpg?aki_policy=profile_x_medium | NA | 2 | 2 | [‘email’, ‘phone’, ‘reviews’] | TRUE | TRUE | Stockholm, Stockholm County, Sweden | Skarpnäcks | NA | 59.24615 | 18.17870 | Entire residential home | Entire home/apt | 3 | NA | 1 bath | 2 | 3 | $1,200.00 | 3 | 730 | 3 | 3 | 730 | 730 | 3 | 730 | NA | TRUE | 6 | 36 | 66 | 66 | 2021-09-30 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | FALSE | 2 | 2 | 0 | 0 | NA |
| 164448 | https://www.airbnb.com/rooms/164448 | 2.021093e+13 | 2021-09-30 | Double room in central Stockholm with Wi-Fi | https://a0.muscache.com/pictures/1101571/13429928_original.jpg | 784312 | https://www.airbnb.com/users/show/784312 | Li | 2011-07-06 | Stockholm, Stockholm, Sweden | within an hour | 100% | 95% | TRUE | https://a0.muscache.com/im/users/784312/profile_pic/1314897997/original.jpg?aki_policy=profile_small | https://a0.muscache.com/im/users/784312/profile_pic/1314897997/original.jpg?aki_policy=profile_x_medium | Södermalm | 2 | 2 | [‘email’, ‘phone’, ‘reviews’, ‘jumio’, ‘government_id’] | TRUE | TRUE | NA | Södermalms | NA | 59.31389 | 18.06087 | Private room in rental unit | Private room | 2 | NA | 1 shared bath | 1 | 2 | $643.00 | 3 | 300 | 3 | 3 | 300 | 300 | 3 | 300 | NA | TRUE | 26 | 56 | 75 | 163 | 2021-09-30 | 319 | 6 | 2 | 2011-10-24 | 2019-07-11 | 4.85 | 4.86 | 4.84 | 4.96 | 4.97 | 4.82 | 4.76 | NA | TRUE | 2 | 0 | 2 | 0 | 2.64 |
| 170651 | https://www.airbnb.com/rooms/170651 | 2.021093e+13 | 2021-09-30 | Petit Charm Rooftop next to heaven | https://a0.muscache.com/pictures/77469446/c3be01c0_original.jpg | 814021 | https://www.airbnb.com/users/show/814021 | Marie | 2011-07-13 | Södermalm, Stockholm County, Sweden | within an hour | 100% | 50% | FALSE | https://a0.muscache.com/im/pictures/user/137c6966-c9a2-42a9-8b8e-dbb1e7dd0a24.jpg?aki_policy=profile_small | https://a0.muscache.com/im/pictures/user/137c6966-c9a2-42a9-8b8e-dbb1e7dd0a24.jpg?aki_policy=profile_x_medium | Södermalm | 1 | 1 | [‘email’, ‘phone’, ‘reviews’, ‘offline_government_id’, ‘government_id’] | TRUE | TRUE | NA | Södermalms | NA | 59.31702 | 18.02946 | Entire rental unit | Entire home/apt | 4 | NA | 1 bath | 1 | 2 | $704.00 | 4 | 30 | 4 | 4 | 30 | 30 | 4 | 30 | NA | TRUE | 4 | 6 | 6 | 200 | 2021-09-30 | 37 | 3 | 1 | 2016-07-23 | 2012-08-04 | 4.66 | 4.82 | 4.55 | 4.88 | 4.91 | 4.82 | 4.70 | NA | FALSE | 1 | 1 | 0 | 0 | 0.59 |
| 206221 | https://www.airbnb.com/rooms/206221 | 2.021093e+13 | 2021-09-30 | Doubleroom at Södermalm &trendySofo | https://a0.muscache.com/pictures/1792713/2c120093_original.jpg | 1022374 | https://www.airbnb.com/users/show/1022374 | Elisabeth | 2011-08-26 | Sweden | within a day | 100% | 0% | FALSE | https://a0.muscache.com/im/users/1022374/profile_pic/1344590239/original.jpg?aki_policy=profile_small | https://a0.muscache.com/im/users/1022374/profile_pic/1344590239/original.jpg?aki_policy=profile_x_medium | NA | 1 | 1 | [‘email’, ‘phone’, ‘reviews’] | TRUE | FALSE | NA | Södermalms | NA | 59.31074 | 18.08128 | Shared room in rental unit | Shared room | 2 | NA | 1 shared bath | 1 | 2 | $669.00 | 3 | 14 | 3 | 3 | 14 | 14 | 3 | 14 | NA | TRUE | 0 | 28 | 46 | 319 | 2021-09-30 | 79 | 0 | 0 | 2014-05-04 | 2018-07-04 | 4.92 | 4.83 | 4.83 | 4.94 | 4.90 | 4.94 | 4.83 | NA | FALSE | 1 | 0 | 0 | 1 | 0.88 |
skim(listings)| Name | listings |
| Number of rows | 2933 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 23 |
| Date | 5 |
| logical | 9 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 2933 | 0 |
| name | 2 | 1.00 | 1 | 112 | 0 | 2880 | 0 |
| description | 135 | 0.95 | 1 | 1000 | 0 | 2711 | 0 |
| neighborhood_overview | 1392 | 0.53 | 1 | 1000 | 0 | 1375 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 2875 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 2351 | 0 |
| host_name | 1 | 1.00 | 1 | 27 | 0 | 1130 | 0 |
| host_location | 14 | 1.00 | 2 | 68 | 0 | 153 | 0 |
| host_about | 1319 | 0.55 | 1 | 2915 | 0 | 1201 | 6 |
| host_response_time | 1 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 1 | 1.00 | 2 | 4 | 0 | 44 | 0 |
| host_acceptance_rate | 1 | 1.00 | 2 | 4 | 0 | 89 | 0 |
| host_thumbnail_url | 1 | 1.00 | 55 | 106 | 0 | 2325 | 0 |
| host_picture_url | 1 | 1.00 | 57 | 109 | 0 | 2325 | 0 |
| host_neighbourhood | 1202 | 0.59 | 6 | 21 | 0 | 23 | 0 |
| host_verifications | 0 | 1.00 | 4 | 170 | 0 | 151 | 0 |
| neighbourhood | 1392 | 0.53 | 14 | 47 | 0 | 77 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 6 | 22 | 0 | 14 | 0 |
| property_type | 0 | 1.00 | 4 | 35 | 0 | 43 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 20 | 0.99 | 6 | 17 | 0 | 26 | 0 |
| amenities | 0 | 1.00 | 2 | 1705 | 0 | 2806 | 0 |
| price | 0 | 1.00 | 5 | 10 | 0 | 871 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-30 | 2021-09-30 | 2021-09-30 | 1 |
| host_since | 1 | 1.00 | 2009-03-11 | 2021-09-22 | 2016-01-26 | 1643 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-30 | 2021-09-30 | 2021-09-30 | 1 |
| first_review | 519 | 0.82 | 2010-03-28 | 2021-09-29 | 2019-05-03 | 1309 |
| last_review | 519 | 0.82 | 2012-06-01 | 2021-09-29 | 2020-01-08 | 947 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 1 | 1 | 0.19 | FAL: 2383, TRU: 549 |
| host_has_profile_pic | 1 | 1 | 0.99 | TRU: 2906, FAL: 26 |
| host_identity_verified | 1 | 1 | 0.76 | TRU: 2242, FAL: 690 |
| neighbourhood_group_cleansed | 2933 | 0 | NaN | : |
| bathrooms | 2933 | 0 | NaN | : |
| calendar_updated | 2933 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.97 | TRU: 2842, FAL: 91 |
| license | 2933 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.26 | FAL: 2177, TRU: 756 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.910908e+07 | 16292581.11 | 2.008300e+04 | 1.501049e+07 | 3.130195e+07 | 4.368965e+07 | 5.250907e+07 | ▅▅▅▆▇ |
| scrape_id | 0 | 1.00 | 2.021093e+13 | 0.00 | 2.021093e+13 | 2.021093e+13 | 2.021093e+13 | 2.021093e+13 | 2.021093e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 1.081495e+08 | 118347971.67 | 9.842000e+03 | 1.668864e+07 | 5.471258e+07 | 1.781562e+08 | 4.241880e+08 | ▇▂▁▁▁ |
| host_listings_count | 1 | 1.00 | 5.440000e+00 | 50.89 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.063000e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 1 | 1.00 | 5.440000e+00 | 50.89 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.063000e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 5.932000e+01 | 0.03 | 5.923000e+01 | 5.930000e+01 | 5.932000e+01 | 5.934000e+01 | 5.942000e+01 | ▁▃▇▂▁ |
| longitude | 0 | 1.00 | 1.803000e+01 | 0.06 | 1.780000e+01 | 1.801000e+01 | 1.805000e+01 | 1.808000e+01 | 1.818000e+01 | ▁▁▂▇▁ |
| accommodates | 0 | 1.00 | 3.200000e+00 | 1.89 | 0.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▅▁▁▁ |
| bedrooms | 292 | 0.90 | 1.640000e+00 | 1.06 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.500000e+01 | ▇▁▁▁▁ |
| beds | 37 | 0.99 | 2.050000e+00 | 1.69 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 7.880000e+00 | 30.23 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 5.000000e+02 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 5.758600e+02 | 552.48 | 1.000000e+00 | 3.000000e+01 | 3.650000e+02 | 1.125000e+03 | 9.999000e+03 | ▇▁▁▁▁ |
| minimum_minimum_nights | 4 | 1.00 | 7.680000e+00 | 29.79 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 5.000000e+00 | 5.000000e+02 | ▇▁▁▁▁ |
| maximum_minimum_nights | 4 | 1.00 | 8.110000e+00 | 30.34 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 5.000000e+02 | ▇▁▁▁▁ |
| minimum_maximum_nights | 4 | 1.00 | 7.152700e+02 | 544.00 | 1.000000e+00 | 3.100000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | ▇▁▁▁▁ |
| maximum_maximum_nights | 4 | 1.00 | 7.309900e+02 | 539.62 | 1.000000e+00 | 4.000000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 4 | 1.00 | 7.940000e+00 | 30.08 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 5.000000e+02 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 4 | 1.00 | 7.294500e+02 | 539.00 | 1.000000e+00 | 4.000000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 8.020000e+00 | 11.08 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.600000e+01 | 3.000000e+01 | ▇▁▁▁▂ |
| availability_60 | 0 | 1.00 | 1.940000e+01 | 23.06 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | 4.000000e+01 | 6.000000e+01 | ▇▁▂▁▃ |
| availability_90 | 0 | 1.00 | 3.286000e+01 | 35.13 | 0.000000e+00 | 0.000000e+00 | 1.700000e+01 | 6.700000e+01 | 9.000000e+01 | ▇▂▁▂▃ |
| availability_365 | 0 | 1.00 | 1.375800e+02 | 134.82 | 0.000000e+00 | 0.000000e+00 | 8.800000e+01 | 2.720000e+02 | 3.650000e+02 | ▇▂▂▂▃ |
| number_of_reviews | 0 | 1.00 | 2.462000e+01 | 49.15 | 0.000000e+00 | 1.000000e+00 | 7.000000e+00 | 2.500000e+01 | 5.930000e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 3.850000e+00 | 9.16 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 3.000000e+00 | 1.480000e+02 | ▇▁▁▁▁ |
| number_of_reviews_l30d | 0 | 1.00 | 4.700000e-01 | 1.26 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+01 | ▇▁▁▁▁ |
| review_scores_rating | 519 | 0.82 | 4.700000e+00 | 0.68 | 0.000000e+00 | 4.670000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_accuracy | 554 | 0.81 | 4.800000e+00 | 0.36 | 1.000000e+00 | 4.760000e+00 | 4.910000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_cleanliness | 554 | 0.81 | 4.720000e+00 | 0.43 | 1.000000e+00 | 4.630000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_checkin | 554 | 0.81 | 4.870000e+00 | 0.28 | 1.000000e+00 | 4.840000e+00 | 4.960000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_communication | 554 | 0.81 | 4.860000e+00 | 0.33 | 1.000000e+00 | 4.840000e+00 | 4.970000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_location | 554 | 0.81 | 4.800000e+00 | 0.30 | 1.000000e+00 | 4.720000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_value | 554 | 0.81 | 4.700000e+00 | 0.37 | 1.000000e+00 | 4.600000e+00 | 4.770000e+00 | 4.930000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 3.870000e+00 | 9.80 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 6.200000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 3.140000e+00 | 9.72 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 6.200000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 5.000000e-01 | 1.20 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 9.000000e+00 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 4.000000e-02 | 0.37 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.000000e+00 | ▇▁▁▁▁ |
| reviews_per_month | 519 | 0.82 | 1.110000e+00 | 1.62 | 1.000000e-02 | 2.000000e-01 | 5.200000e-01 | 1.300000e+00 | 2.149000e+01 | ▇▁▁▁▁ |
#use skim function to summarize all variables we have and look at NA values
#skim(listings) %>%
# kable(caption = "Brief summary on all variables in our original dataset",
# align = "l") %>%
# kable_classic(c("striped", "hover", "condensed"),
# html_font = "Arial",
# fixed_thead = T) %>%
# row_spec(0, bold = T) %>%
# scroll_box(width = "100%", height = "400px")From the above, we can see that our dataframe contains 2,933 observations and 74 variables. Of these variables, 23 are of type character, 5 are of type date, 9 of type logical (TRUE or FALSE) and the remaining 37 variables are numeric. We can find a data dictionary here and understand what all these variables mean.
Since it does not make sense to include all variables as explanatory variables in our model to predict the price of a 4-night stay in Stockholm, we will now continue to explore the dataframe and identify those variables that are of particular interest.
After having looked at the dataframe in more detail, we conclude that we can disregard many variables as they will not provide any useful information about the price. These include i.a. the scrape_id, last_scraped, picture_url or host_id. The latter one we remove as we cannot be sure that a shorter host_id means that the host is already longer active on AirBnB or not and due to the fact that we have the variable host_since and gives us the information. We will remove them from the dataframe which will also later make our calculations faster.
#check whether host_total_listings_count and host_listings_count are the same
if(listings$host_total_listings_count == listings$host_listings_count){
print("Is identical")
}[1] "Is identical"
listings_clean <- listings %>%
#remove irrelevant variables
select(-c(description, neighborhood_overview, scrape_id, last_scraped,
picture_url, host_url, host_id, host_name, host_about,
host_thumbnail_url, host_picture_url, host_verifications,
host_location, host_total_listings_count, host_neighbourhood,
host_identity_verified,
#remove variables other than minimum and maximum nights (as these are the ones applicable to the listing)
minimum_minimum_nights, minimum_maximum_nights, maximum_minimum_nights, maximum_maximum_nights,
minimum_nights_avg_ntm, maximum_nights_avg_ntm,
calendar_updated, calendar_last_scraped,
#remove license as only NAs
license,
#remove date of first review as only latest relevant
first_review,
#remove detailed counts of listings
calculated_host_listings_count_entire_homes,
calculated_host_listings_count_private_rooms,
calculated_host_listings_count_shared_rooms
))
#use skim function for to summarize all variables we have and look at NA values
skim(listings_clean) %>%
kable(caption = "Brief summary on all variables in our cleaned dataset",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
fixed_thead = T) %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%", height = "400px")| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | Date.min | Date.max | Date.median | Date.n_unique | logical.mean | logical.count | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | listing_url | 0 | 1.0000000 | 34 | 37 | 0 | 2933 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | name | 2 | 0.9993181 | 1 | 112 | 0 | 2880 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_time | 1 | 0.9996591 | 3 | 18 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_rate | 1 | 0.9996591 | 2 | 4 | 0 | 44 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_acceptance_rate | 1 | 0.9996591 | 2 | 4 | 0 | 89 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood | 1392 | 0.5254006 | 14 | 47 | 0 | 77 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood_cleansed | 0 | 1.0000000 | 6 | 22 | 0 | 14 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | property_type | 0 | 1.0000000 | 4 | 35 | 0 | 43 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | room_type | 0 | 1.0000000 | 10 | 15 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | bathrooms_text | 20 | 0.9931810 | 6 | 17 | 0 | 26 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | amenities | 0 | 1.0000000 | 2 | 1705 | 0 | 2806 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | price | 0 | 1.0000000 | 5 | 10 | 0 | 871 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | host_since | 1 | 0.9996591 | NA | NA | NA | NA | NA | 2009-03-11 | 2021-09-22 | 2016-01-26 | 1643 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | last_review | 519 | 0.8230481 | NA | NA | NA | NA | NA | 2012-06-01 | 2021-09-29 | 2020-01-08 | 947 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_is_superhost | 1 | 0.9996591 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.1872442 | FAL: 2383, TRU: 549 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_has_profile_pic | 1 | 0.9996591 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.9911323 | TRU: 2906, FAL: 26 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | neighbourhood_group_cleansed | 2933 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | bathrooms | 2933 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | has_availability | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.9689737 | TRU: 2842, FAL: 91 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | instant_bookable | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.2577566 | FAL: 2177, TRU: 756 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.910908e+07 | 1.629258e+07 | 20083.00000 | 1.501049e+07 | 3.130195e+07 | 4.368965e+07 | 5.250907e+07 | ▅▅▅▆▇ |
| numeric | host_listings_count | 1 | 0.9996591 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.437244e+00 | 5.089220e+01 | 0.00000 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.063000e+03 | ▇▁▁▁▁ |
| numeric | latitude | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.932090e+01 | 3.100260e-02 | 59.23411 | 5.930386e+01 | 5.931884e+01 | 5.933834e+01 | 5.941933e+01 | ▁▃▇▂▁ |
| numeric | longitude | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.803201e+01 | 6.435650e-02 | 17.79731 | 1.800573e+01 | 1.805074e+01 | 1.807608e+01 | 1.817870e+01 | ▁▁▂▇▁ |
| numeric | accommodates | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.200477e+00 | 1.894211e+00 | 0.00000 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▅▁▁▁ |
| numeric | bedrooms | 292 | 0.9004432 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.637259e+00 | 1.055019e+00 | 1.00000 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.500000e+01 | ▇▁▁▁▁ |
| numeric | beds | 37 | 0.9873849 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.049033e+00 | 1.686176e+00 | 0.00000 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| numeric | minimum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 7.875213e+00 | 3.022840e+01 | 1.00000 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 5.000000e+02 | ▇▁▁▁▁ |
| numeric | maximum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.758616e+02 | 5.524845e+02 | 1.00000 | 3.000000e+01 | 3.650000e+02 | 1.125000e+03 | 9.999000e+03 | ▇▁▁▁▁ |
| numeric | availability_30 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.022503e+00 | 1.107925e+01 | 0.00000 | 0.000000e+00 | 0.000000e+00 | 1.600000e+01 | 3.000000e+01 | ▇▁▁▁▂ |
| numeric | availability_60 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.940027e+01 | 2.306382e+01 | 0.00000 | 0.000000e+00 | 4.000000e+00 | 4.000000e+01 | 6.000000e+01 | ▇▁▂▁▃ |
| numeric | availability_90 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.286498e+01 | 3.512540e+01 | 0.00000 | 0.000000e+00 | 1.700000e+01 | 6.700000e+01 | 9.000000e+01 | ▇▂▁▂▃ |
| numeric | availability_365 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.375813e+02 | 1.348176e+02 | 0.00000 | 0.000000e+00 | 8.800000e+01 | 2.720000e+02 | 3.650000e+02 | ▇▂▂▂▃ |
| numeric | number_of_reviews | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.462462e+01 | 4.914633e+01 | 0.00000 | 1.000000e+00 | 7.000000e+00 | 2.500000e+01 | 5.930000e+02 | ▇▁▁▁▁ |
| numeric | number_of_reviews_ltm | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.846573e+00 | 9.155973e+00 | 0.00000 | 0.000000e+00 | 1.000000e+00 | 3.000000e+00 | 1.480000e+02 | ▇▁▁▁▁ |
| numeric | number_of_reviews_l30d | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.742584e-01 | 1.255996e+00 | 0.00000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+01 | ▇▁▁▁▁ |
| numeric | review_scores_rating | 519 | 0.8230481 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.695261e+00 | 6.834498e-01 | 0.00000 | 4.670000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| numeric | review_scores_accuracy | 554 | 0.8111149 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.801740e+00 | 3.616418e-01 | 1.00000 | 4.760000e+00 | 4.910000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| numeric | review_scores_cleanliness | 554 | 0.8111149 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.716545e+00 | 4.338537e-01 | 1.00000 | 4.630000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| numeric | review_scores_checkin | 554 | 0.8111149 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.871240e+00 | 2.828460e-01 | 1.00000 | 4.840000e+00 | 4.960000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| numeric | review_scores_communication | 554 | 0.8111149 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.855641e+00 | 3.300165e-01 | 1.00000 | 4.840000e+00 | 4.970000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| numeric | review_scores_location | 554 | 0.8111149 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.800084e+00 | 3.032844e-01 | 1.00000 | 4.725000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| numeric | review_scores_value | 554 | 0.8111149 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.701833e+00 | 3.662254e-01 | 1.00000 | 4.600000e+00 | 4.770000e+00 | 4.930000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| numeric | calculated_host_listings_count | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.872826e+00 | 9.802336e+00 | 1.00000 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 6.200000e+01 | ▇▁▁▁▁ |
| numeric | reviews_per_month | 519 | 0.8230481 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.114917e+00 | 1.621651e+00 | 0.01000 | 2.000000e-01 | 5.250000e-01 | 1.300000e+00 | 2.149000e+01 | ▇▁▁▁▁ |
Now that we have removed some of the unnecessary variables and reduced numbers of columns from 73 to 45, we want to inspect the remaining variables in more detail. This will not only help us to understand the data better but also to clean it even further for our later regression analysis.
Let us first have a look at the price variable that we are interested in to predict later. Please note that the price is given in SEK.
typeof(listings_clean$price)[1] "character"
The type of price variable is character at the moment. Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the first number.
#convert character variable to numeric
listings_clean <- listings_clean %>%
mutate(price = parse_number(price))
typeof(listings_clean$price)[1] "double"
We will look at some statistics about price.
#summary statistics of price
favstats(~price, data = listings_clean) %>%
kable(caption = "Statistics on listing price in Stockholm",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T)| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 600 | 900 | 1429 | 25000 | 1164.388 | 1147.716 | 2933 | 0 |
#distribution of price
density <- ggplot(listings_clean, aes(x=price)) +
geom_density(fill="pink")+
labs(x = NULL,
y = "Density") +
theme_bw()+
NULL
boxpl <- ggplot(listings_clean, aes(x = price)) +
geom_boxplot(fill = "pink")+
labs(x = NULL, y = NULL) +
theme_bw()+
NULL
grid.arrange(density, boxpl, ncol=2, nrow=1,
top=textGrob("Distribution of AirBnB prices per night in Stockholm"),
bottom = textGrob("Price per night", gp = gpar(cex = 0.9)))The distribution of price per night in SEK is unimodel and heavily positively skewed, demonstrated by the mean (1,164SEK) being greater than the median (900SEK), the distribution is thus non-normally distributed. The median price per night is 900SEK, or roughly £75. The large dispersion of prices is almost certainly a consequence of the nondescript nature in which the price data was collected, with no filters applied the maximum price per night in Stockholm is 25,000SEK (c.£2,100) likely for a large and extremely premium property. So far, we can’t conclude on the property type or location for this property as we don’t know whether its price is a function of location or just size and amenities - i.e could be a large detached property in green spaces outside central Stockholm or large modern loft in the middle of the city.
It is interesting to note that a price of 0SEK per night is observed, perhaps this is an input error or it could be a free sofa-surfing arrangement offered by the host for travelers or those in need. The concentration or high kurtosis of the distribution is somewhat surprising. The middle 50% of the data is between 600 and 1,429 SEK, which converts to c.£51 and £121, respectively. A median price of £75 a night is relatively cheap for a developed nations capital city, an inference reinforced when considering there are few listings beyond 2500SEK a night. This could perhaps be a reflection of the socio-economic backgrounds of those listing properties on AirBnB and also perhaps a reflection of Swedish urban housing. Having looked at the various neighbourhoods, small apartments in tower blocks appear to dominate the housing on offer to local residents. This is unlike London where there is a greater mix of semi-detached housing, converted flats and purpose built complexes.
The boxplot above clearly shows that we have to very strong outliers with a price of 20,000SEK or above, which are not considered anywhere close to what you would normally pay for one night as a tourist. Given that these outlies may distort our future calculations, we will remove them from the dataset.
Last but not least, we need to address the fact that we have AirBnBs with zero price. All of them are hotel rooms, so perhaps the hotel has already filled them or they forgot to take the listing out of AirBnB’s platform. In either case, these should also be disregarded in the final dataset and model, as they only distort the data.
#remove price outliers
listings_clean <- listings_clean %>%
filter(price < 20000 & price > 0)AirBnB is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes. Therefore, let us look at the variable minimum_night that indicates the minimum number of nights stay for the listing.
hist1 <- ggplot(data = listings_clean, aes(x = minimum_nights)) +
geom_histogram(fill = "navy") +
labs(x = NULL, y = "Number of listings") +
theme_bw()+
NULL
hist2 <- ggplot(data = listings_clean, aes(x = minimum_nights)) +
geom_histogram(binwidth = 1, fill="navy") +
coord_cartesian(xlim = c(0,8)) +
#scale_x_discrete(breaks=seq(-2,8,by=1)) +
labs(x = NULL, y = NULL) +
theme_bw()+
NULL
library(cowplot)
#plot_grid(hist1, hist2)
grid.arrange(hist1, hist2, ncol=2, nrow=1,
top=textGrob("Number of listings according to minimum nights required"),
bottom = textGrob("Number of minimum nights required",
gp = gpar(cex = 0.9)))The right plot form the above output indicates that 2 is the most common number of minimum nights required by the hosts, followed by a minimum stay of 1 night. This is confirmed by the following table counting the different number of nights required:
head(listings_clean %>%
count(minimum_nights) %>%
mutate(percent = round(n/sum(n) * 100,2)) %>%
arrange(desc(n)), n=10) %>%
kable(caption = "Summary of how many nights is required as minimum stay",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F) %>%
row_spec(0, bold = T)| minimum_nights | n | percent |
|---|---|---|
| 2 | 748 | 25.58 |
| 1 | 650 | 22.23 |
| 3 | 476 | 16.28 |
| 4 | 279 | 9.54 |
| 5 | 243 | 8.31 |
| 7 | 168 | 5.75 |
| 6 | 76 | 2.60 |
| 30 | 58 | 1.98 |
| 14 | 43 | 1.47 |
| 10 | 21 | 0.72 |
We also see a bump at 7 days, meaning that certain hosts require their guests to stay 7 days. This might be due to the fact that they are on vacation during that time themselves and want to rent their apartment/house for the full time they are away.
From the left plot of the above output we can see that there are some listings that expect people to stay for at least 30 days, 100 days, one year (365 days) or even 500 days. These listings are most probably intended to be long-term rentals and maybe a way for the landlord to circumvent tax by renting them as AirBnBs instead of as classic apartments. They are, however, not intended for tourists or short-term visitors and are thus not useful in our analysis as we want to predict the price for a 4-night stay. This is why we remove them from the dataframe in a next step.
#remove listings with more than 4 nights required
listings_clean <- listings_clean %>%
filter(minimum_nights <= 4)Next, we look at the variable property_type. We can use the count function to determine how many categories there are and their frequency. From the below output we can see that the 4 most common property types are Entire rental unit, Private room in rental unit, Entire residential home and Entire condominium (condo). Together they make up more than 80% of the property types.
#count observations per property type category
head(listings_clean %>%
count(property_type) %>%
#calculate percentage
mutate(percent = round(n/sum(n)*100,2)) %>%
arrange(desc(percent)), n = 15) %>%
kable(caption = "Summary of property types",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F,
fixed_thead = T) %>%
row_spec(0, bold = T) %>%
scroll_box(height = "250px")| property_type | n | percent |
|---|---|---|
| Entire rental unit | 1149 | 53.37 |
| Private room in rental unit | 382 | 17.74 |
| Entire condominium (condo) | 114 | 5.29 |
| Entire residential home | 110 | 5.11 |
| Entire townhouse | 44 | 2.04 |
| Entire loft | 35 | 1.63 |
| Room in hotel | 34 | 1.58 |
| Private room in bed and breakfast | 25 | 1.16 |
| Shared room in rental unit | 24 | 1.11 |
| Private room in condominium (condo) | 21 | 0.98 |
| Private room in residential home | 18 | 0.84 |
| Private room in townhouse | 17 | 0.79 |
| Private room in villa | 16 | 0.74 |
| Entire villa | 15 | 0.70 |
| Room in aparthotel | 15 | 0.70 |
Since the vast majority of the observations in the data are one of the top four property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other.
#create vector to order simplifed property types later
ordered_p_types = c("Entire rental unit","Private room in rental unit",
"Entire residential home","Entire condominium (condo)",
"Other")
#simplify property_type variable to only have 5 categories
listings_clean <- listings_clean %>%
mutate(prop_type_simplified = factor(case_when(
property_type %in% c("Entire rental unit","Private room in rental unit",
"Entire residential home","Entire condominium (condo)") ~ property_type,
TRUE ~ "Other"),
levels = ordered_p_types,
labels = ordered_p_types))
#check whether new variable was correctly created
listings_clean %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n)) %>%
kable(caption = "Summary of simplified property types",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F,
fixed_thead = T) %>%
row_spec(0, bold = T) %>%
scroll_box(height = "250px")| property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 1149 |
| Private room in rental unit | Private room in rental unit | 382 |
| Entire condominium (condo) | Entire condominium (condo) | 114 |
| Entire residential home | Entire residential home | 110 |
| Entire townhouse | Other | 44 |
| Entire loft | Other | 35 |
| Room in hotel | Other | 34 |
| Private room in bed and breakfast | Other | 25 |
| Shared room in rental unit | Other | 24 |
| Private room in condominium (condo) | Other | 21 |
| Private room in residential home | Other | 18 |
| Private room in townhouse | Other | 17 |
| Private room in villa | Other | 16 |
| Entire villa | Other | 15 |
| Room in aparthotel | Other | 15 |
| Room in serviced apartment | Other | 15 |
| Entire guesthouse | Other | 13 |
| Entire serviced apartment | Other | 12 |
| Shared room in hostel | Other | 12 |
| Room in hostel | Other | 11 |
| Entire cabin | Other | 8 |
| Private room in hostel | Other | 8 |
| Private room in loft | Other | 7 |
| Tiny house | Other | 6 |
| Room in boutique hotel | Other | 5 |
| Boat | Other | 4 |
| Entire guest suite | Other | 4 |
| Private room in guesthouse | Other | 4 |
| Camper/RV | Other | 3 |
| Private room in casa particular | Other | 3 |
| Private room in serviced apartment | Other | 3 |
| Entire cottage | Other | 2 |
| Private room in boat | Other | 2 |
| Private room in guest suite | Other | 2 |
| Shared room in bed and breakfast | Other | 2 |
| Shared room in condominium (condo) | Other | 2 |
| Casa particular | Other | 1 |
| Entire in-law | Other | 1 |
| Farm stay | Other | 1 |
| Shared room in cabin | Other | 1 |
| Shared room in residential home | Other | 1 |
| Shared room in tiny house | Other | 1 |
Let us now know look at the price distribution for these 5 categories and see whether we can find a difference.
#summary statistics of price categorised by room type
favstats(price~prop_type_simplified, data = listings_clean) %>%
kable(caption = "Statistics on simplified property types",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| prop_type_simplified | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Entire rental unit | 99 | 755.00 | 1000.0 | 1500.00 | 12000 | 1242.2124 | 885.3052 | 1149 | 0 |
| Private room in rental unit | 150 | 350.75 | 490.0 | 679.50 | 6000 | 595.5471 | 531.8034 | 382 | 0 |
| Entire residential home | 250 | 1268.50 | 1745.0 | 2185.75 | 8000 | 1868.7727 | 962.6486 | 110 | 0 |
| Entire condominium (condo) | 496 | 850.00 | 1003.0 | 1499.75 | 4500 | 1251.7281 | 714.1561 | 114 | 0 |
| Other | 120 | 453.25 | 713.5 | 1218.00 | 12015 | 1053.9673 | 1108.8448 | 398 | 0 |
#plot distribution of price for each room type
ggplot(data=listings_clean,
aes(x = prop_type_simplified, y = price,
color = prop_type_simplified)) +
geom_boxplot() +
theme_bw() +
labs(title = "Distribution of price among property types",
x = "Property type",
y = "Price") +
theme(legend.position = "none",
axis.text=element_text(size=7)) +
NULLThe boxplot of the property types shows clearly that the type with the lowest median is a Private room in a rental unit. The type with the highest median is Entire residential home. The highest and lowest medians make intuitive sense as the lowest is for a single room whilst the highest is for an entire home.
An interesting note though, is the number of outliers in each of the property types. This probably has mostly to do with the subjectivity of classifying property types. As the data shows clearly, the type with the most outliers is Entire rental unit. This definition can encompass a studio flat or an entire house, which would cost significantly more than a single studio. The Other section has a significant number of outliers which is likely due to the cleaning done in the previous section that grouped many different property types together. Looking at the raw data shows this subjectivity, boats, campervans and cabins are included in other along with entire townhouses and lofts. The effect of subjectivity on ranges and outliers is also apparent with the condominium property type as that has much fewer outliers and a tight range, due to the how the property type has less room for subjectivity for the hosts in listing their properties.
It is further interesting to note the concentration of the private room in rental unit distribution. Although it is certainly non-normal the concentration of price around its median is impressive and suggests that the market for private rooms is in a somewhat equilibrium with a clear tight range of prevailing market prices dominating listings. Entire residential homes, on the other hand, have a more dispersed distribution as it resembles a normal distribution far more than the other categories. This suggests a wider variation in the prevailing market prices but is also perhaps a reflection on, again, the subjectivity of categorization as residential homes can widely vary in size, location and amenities.
A further observation is that entire rental units make up over half of the properties listed. This could have a significant impact on our predictive model.
Our dataset does not only contain information about the property type of the listing but also about the room type. We will inspect this variable in a next step.
#count observations per room type category
head(listings_clean %>%
count(room_type) %>%
#calculate percentage
mutate(percent = round(n/sum(n)*100,2)) %>%
arrange(desc(percent)), n = 15) %>%
kable(caption = "Summary of room types",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F) %>%
row_spec(0, bold = T)| room_type | n | percent |
|---|---|---|
| Entire home/apt | 1537 | 71.39 |
| Private room | 541 | 25.13 |
| Shared room | 43 | 2.00 |
| Hotel room | 32 | 1.49 |
#summary statistics of price based on room categories
favstats(price ~ room_type, data = listings_clean) %>%
kable(caption = "Statistics on room types",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| room_type | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Entire home/apt | 99 | 786.0 | 1024 | 1509.0 | 12000 | 1308.5062 | 933.0245 | 1537 | 0 |
| Hotel room | 234 | 583.5 | 650 | 824.0 | 2539 | 738.0938 | 441.7582 | 32 | 0 |
| Private room | 150 | 379.0 | 500 | 745.0 | 12015 | 666.2421 | 729.8080 | 541 | 0 |
| Shared room | 120 | 169.0 | 364 | 661.5 | 4500 | 635.1628 | 849.5422 | 43 | 0 |
#visualise distribution of price among room categories
ggplot(listings_clean,
aes(x = room_type, y = price,
color = room_type)) +
geom_boxplot() +
#facet_wrap(~room_type)+
labs(title = "Distribution of price among room types",
x = "Price per night",
y = "Density") +
theme_bw() +
theme(legend.position = "none") +
NULLFrom the table above as well as the boxplot diagrams, we can see that entire homes and apartments have the highest average price and thus show a strong difference in price to that of other room types. The boxplots of the different room types are all right skewed, meaning all the means are higher than the medians, except for hotel rooms. We will keep this difference in mind later when we build our model. It is also important to note that c.71% of the room data belongs to the entire home/apt category with a further c.25% being made up of private rooms.
The variable accommodates gives us an idea of how many people can fit into the listed property. We can see from the output below that around 40% of AirBnB rooms are made to accommodate 2 people. The vast majority of rooms are made to accommodate less than 4 people. This makes sense as AirBnB is focused on normal homes being rented out to guests which mean the housing capacity is lower, especially in cities such as Stockholm where most people live in flats rather than houses. The scattering of AirBnBs that accommodate more than 5 people could be families that have moved out of Stockholm looking for long term leases of their homes.
#get summary statistics of `accomodates` variable
favstats(~accommodates, data = listings_clean) %>%
kable(caption = "Statistics on how many people the property accommodates",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 2 | 4 | 16 | 3.143985 | 1.860288 | 2153 | 0 |
#plot distribution of variable
ggplot(listings_clean, aes(x = accommodates)) +
geom_histogram(fill = "darkgreen")+
labs(title = "Number of listings for different numbers of capacity",
x = "Number of people that can be accommodated",
y = "Number of listings") +
theme_bw() +
theme(legend.position = "none") +
NULLSince the number of how many people an apartment/house can fit goes hand in hand with the number of bathrooms and bedrooms provided, we also want to check the correlation among these variables.
Before doing this, we first need to convert bathrooms_text into bathrooms, with the latter being a numeric, and include an additional step to account for the NAs produced with this method.
#convert into numeric
listings_clean <- listings_clean %>%
mutate(bathrooms = parse_number(bathrooms_text))
#check whether transformation worked
typeof(listings_clean$bathrooms)[1] "double"
#check why there are NA values of bathrooms
listings_clean %>%
filter(is.na(bathrooms)) %>%
count(bathrooms_text) %>%
kable(caption = "For rows that have NA values in bathrooms, their original bathrooms texts have these following values",
align = "r") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F) %>%
row_spec(0, bold = T) %>%
column_spec(1, width = "40%")| bathrooms_text | n |
|---|---|
| Half-bath | 4 |
| Private half-bath | 3 |
| NA | 12 |
We can see all samples with NA values in the new bathrooms variables have half-baths or NA in bathrooms_text. We decide to assign 0.5 to bathrooms if it has half-bathroom, otherwise leave it as NA.
listings_clean <- listings_clean %>%
mutate(bathrooms = ifelse(!is.na(bathrooms_text) & is.na(bathrooms),
0.5, bathrooms))Now let’s take a look on accommodates, bathrooms, bedrooms and beds.
#Use ggpairs() to show correlation
listings_clean %>%
select(accommodates, bedrooms, bathrooms, beds) %>%
ggpairs()The correlation coefficients calculated above are as would be expected. The correlation of bathrooms, beds, and bedrooms with accommodation are all positive as they directly relate to the number of people that can accommodate a certain area.
The data shows that the number of beds and bedrooms are more strongly correlated to the number of people that can be accommodated than to each other. This makes sense as the number of beds directly relate to the number of people that can stay there, while bathrooms can be shared between multiple people.
It is somewhat surprising is that beds and bedrooms are not more perfectly correlated. Perhaps this is because hosts count sofa beds in their bed count but don’t count the room the bed is in as a bedroom, as it is more likely a living room. Nevertheless, we will keep the correlation in mind for our regression analysis later.
Since we want to look at how much a 2-people stay would cost, it comes natural to filter out those properties that only accommodates 1 person.
listings_clean <- listings_clean %>%
filter(accommodates >= 2)There are different variables for the availability of the listings (availability_30, availability_60, availability_90, availability_365), based on the time horizon. Let us look at the distribution of each of them and their correlation with each other and the price with the help of ggpairs().
listings_clean %>%
select(availability_30, availability_60, availability_90, availability_365, price) %>%
ggpairs()The correlation of the availabilities are with one another are all positive. Upon closer inspection, the correlation of the availabilities closest to one another are the ones that are most strongly correlated. For example, the correlation between availability_30 and availability_60 is 0.94, almost a perfect correlation, the same holds for the correlation between availability_60 and availability_90 which is even higher with 0.97. This indeed makes sense as properties that are available for 30 days should in most cases, barring a booking in the future, be available for 60 or 90 days as well. The correlation coefficients then decrease substantially when calculated with availability_365. This could be because many hosts are renting AirBnBs temporarily and not permanently, meaning they don’t take bookings that far in the future as they are just planning on renting it out for a few months.
Due to the fact that the availability measures are all highly correlated, we will only choose one of them in order to explain the price later. Since the availability for the next 30 days (availability_30) in the one highest correlated with the price, we will only keep this variable. However, the correlation is very close to zero which is why we do not expect that the availability will be a significant predictor of the price.
While we have multiple variables for the number of reviews, those variables for the number of reviews in the last 12 months (number_of_reviews_ltm) and in the last 30 days (number_of_reviews_l30d) are oftentimes zero. This is why we will first analyse the variable for the total number of reviews (number_of_reviews) further.
#summary statistics of total number of reviews
favstats(~number_of_reviews, data = listings_clean) %>%
kable(caption = "Statistics on number of reviews",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 10 | 33 | 593 | 30.27975 | 55.09263 | 1916 | 0 |
#visualisation of distribution of number of reviews
histo1 <- ggplot(listings_clean, aes(x = number_of_reviews)) +
geom_histogram(fill = "lightblue",
color = "white") +
labs(x = NULL,
y = "Number of listings") +
theme_bw() +
NULL
histo2 <- ggplot(listings_clean, aes(x = number_of_reviews)) +
geom_histogram(binwidth = 5,
fill = "lightblue",
color = "white") +
coord_cartesian(xlim = c(0,50)) +
labs(x = NULL, y = NULL) +
theme_bw() +
NULL
#plot_grid(histo1, histo2)
grid.arrange(histo1, histo2, ncol=2, nrow=1,
top=textGrob("Number of listings according to number of reviews"),
bottom = textGrob("Number of reviews", gp = gpar(cex = 0.9)))As shown above, the mean and median number of reviews was calculated to be 30 and 10 respectively. This could be due to the high number of outliers that have many reviews, as shown in the data set (there is 1 with 593 reviews and 2 with around 260 reviews). It is also shown in the histogram of the data set. The small spike near the 300 mark is visible while the single outlier of 593 is not.
The histogram also makes clear that most of the AirBnBs have 10 or less reviews. This might just simply be due to the fact that people do not like giving reviews or because many new AirBnB listings have recently been added. More people might have started opening AirBnB as a way of making of a living after having been through financial hardships caused by COVID and the lockdown heavy environment of the last few years.
Since the number of reviews has a high variability, we create a new variable called reviews_30_plus that will take yes if the listing has 30 or more reviews and no otherwise. We will look at the price for these two groups as a second step to see whether there is a difference.
#create new variable
listings_clean <- listings_clean %>%
mutate(reviews_30_plus = ifelse(number_of_reviews >= 30, "yes", "no"))
#summary statistics for price based on new variable
favstats(price~reviews_30_plus, data = listings_clean) %>%
kable(caption = "Statistics on whether the listing has more than 30 reviews",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| reviews_30_plus | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| no | 99 | 694.5 | 995 | 1500 | 12015 | 1215.147 | 967.5271 | 1395 | 0 |
| yes | 105 | 620.0 | 870 | 1300 | 8100 | 1107.914 | 829.4636 | 521 | 0 |
#visualise distribution of price based on that variable
ggplot(listings_clean,
aes(x = reviews_30_plus, y = price,
fill=reviews_30_plus)) +
geom_boxplot() +
labs(title = "Distribution of price based on whether listings have at least 30 reviews",
x = "Price per night",
y = "Density") +
theme_bw() +
theme(legend.position = "none") +
NULLFrom the above table and boxplots we cannot really see a stark difference in the price between the listings with higher and lower reviews. In order to get a sense of that, let us calculate the confidence intervals for the mean price for the two groups and conduct a hypothesis test in order to check whether there is a difference in the mean of the price among those two groups.
# calculate CIs with formula
formula_ci <- listings_clean %>%
group_by(reviews_30_plus) %>%
summarise(mean_price = mean(price, na.rm = TRUE),
sd_price = sd(price, na.rm = TRUE),
count = n(),
SE = sd_price/sqrt(count),
t_critical = qt(0.975, count - 1),
lower = mean_price - t_critical * SE,
upper = mean_price + t_critical * SE)
formula_ci %>%
kable(caption = "Statistics on whether the listing has more than 30 reviews",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| reviews_30_plus | mean_price | sd_price | count | SE | t_critical | lower | upper |
|---|---|---|---|---|---|---|---|
| no | 1215.147 | 967.5271 | 1395 | 25.90455 | 1.961667 | 1164.331 | 1265.963 |
| yes | 1107.914 | 829.4636 | 521 | 36.33946 | 1.964536 | 1036.523 | 1179.304 |
#conduct hypothesis test with t.test
t.test(price~reviews_30_plus, data = listings_clean)
Welch Two Sample t-test
data: price by reviews_30_plus
t = 2.4029, df = 1078.8, p-value = 0.01644
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
19.66705 194.79960
sample estimates:
mean in group no mean in group yes
1215.147 1107.914
Even though the calculated confidence intervals slightly overlap, the hypothesis test shows that the means are significantly different up until the 99% confidence interval. The 95% confidence interval is normally used in academia and so we can reject the null hypothesis in this case. This result means that we can expect listing prices to differ between homes with more than 30 reviews versus those with less than 30 reviews, an observation useful for our future regression analysis. We will keep this in mind later when we conduct our regression analyses.
In relation to the total number of reviews we are also given an average number of reviews_per_month. Let us look at the distribution of this variable and its correlation to the total number of reviews as well as the price.
#summary statistics of reviews_per_month
favstats(~reviews_per_month, data = listings_clean) %>%
kable(caption = "Statistics on monthly reviews",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0.01 | 0.28 | 0.69 | 1.7375 | 21.49 | 1.342861 | 1.797447 | 1646 | 270 |
#distribution of reviews_per_month
ggplot(listings_clean, aes(reviews_per_month)) +
geom_histogram(fill= "lightblue") +
labs(title = "Distribution of number of reviews per month",
x = "Number of reviews per month",
y = NULL) +
theme_bw() +
NULL#calculate correlation between reviews_per_month, number_of_reviews and price
listings_clean %>%
dplyr::select(reviews_per_month, number_of_reviews, price) %>%
cor(use = "na.or.complete") reviews_per_month number_of_reviews price
reviews_per_month 1.0000000 0.42944551 -0.11352986
number_of_reviews 0.4294455 1.00000000 -0.06982266
price -0.1135299 -0.06982266 1.00000000
The data shows that AirBnB does not receive many reviews every month for the Stockholm listings. The median is 0.69 and the mean is 1.34. This is expected as most people do not make the effort to go back and review a place, good or bad, unless it was exceptionally good or exceptionally bad.
After calculating the correlation coefficients, it is clear that the number_of_reviews is positively correlated with the reviews_per_month. What is interesting is that correlation between price and both number_of_reviews and reviews_per_month. The correlations were calculated to be -0.069 and -0.114 respectively. Although it is negatively correlated, the coefficients are very very low, meaning price is not really effected by the number_of_reviews and reviews_per_month. This could be because there is no way to ascertain whether the reviews are good or bad, so a high number of reviews doesn’t mean a certain AirBnB is better, which would lead the price to be higher. However, it could also mean that the more information people have about an apartment in terms of reviews, the less leeway hosts have in pricing, thus driving prices down.
We have multiple variables that give information about the rating of each listing, including review_scores_rating, review_scores_accuracy, review_scores_cleanliness, etc. As we expect these reviews to be positively correlated, we will look at their distribution and correlation in more detail.
listings_clean %>%
select(review_scores_rating, review_scores_accuracy,
review_scores_cleanliness, review_scores_communication,
review_scores_checkin, review_scores_location,
review_scores_value) %>%
ggpairs()We can see from the density plots of the ratings variables that most of the ratings are in the very upper end, i.e. 4.5 or above. From the above scatterplots and correlation coefficients we can see that the different rating scores are all highly correlated. This is why we will most probably only make use of one rating score in our later regressions so that we avoid collinearity. In our opinion, it makes most sense to thereby use the variable review_scores_rating as this seems to be the overall rating of the listing. Let us look at its correlation with the price.
#scatterplot of price and review_scores_rating
ggplot(listings_clean, aes(x = review_scores_rating, y = price)) +
geom_point(color = "purple") +
#use logarithmic scale
scale_y_log10() +
labs(title = "Price and scores review",
y = "Price",
x = "Review Rating") +
theme_bw()+
NULL#calculate correlation coefficient
listings_clean %>%
select(review_scores_rating, price) %>%
cor(use = "na.or.complete") review_scores_rating price
review_scores_rating 1.00000000 -0.01283758
price -0.01283758 1.00000000
By plotting the price and the reviews’ scores, we cannot have a clear view about their relationship. It seems that most of the apartments are rated above 4.5. However, even in few cases where the rating was 0, the apartment still had as high a price as that of 5/5 rated AirBnBs. We must not forget that these are reviews by users of the internet, meaning that even though most of the times these can be helpful, there will always be people that just want to “have fun” so they give 0/5 reviews for no specific reason. Be that as it may, although we would expect apartments with higher ratings to have a higher price, this is not always the case. Therefore given that this variable is not very informative, we want to create a new variable called top_reviewed that will be yes if the review_scores_rating is higher or equal to 4.8 and no otherwise. We will also check whether there is a significant difference in price between the two groups.
#create new variable top_reviewed
listings_clean <- listings_clean %>%
mutate(top_reviewed = ifelse(review_scores_rating >= 4.8, "yes", "no"))
#check distribution of top_reviewed variable
listings_clean %>%
filter(!is.na(top_reviewed)) %>%
count(top_reviewed) %>%
mutate(percent = round(n/sum(n)*100, 2)) %>%
arrange(desc(percent))%>%
kable(caption = "Summary of whether the listing is top reviewed",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F) %>%
row_spec(0, bold = T)| top_reviewed | n | percent |
|---|---|---|
| yes | 956 | 58.08 |
| no | 690 | 41.92 |
#check whether difference in price and calculate CI
favstats(price~top_reviewed, data = listings_clean) %>%
mutate(SE = sd/sqrt(n),
t_crit = qt(0.975, n-1),
margin_of_error = t_crit * SE,
lower = mean - margin_of_error,
upper = mean + margin_of_error) %>%
kable(caption = "Statistics on whether the listing is top reviewed",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| top_reviewed | min | Q1 | median | Q3 | max | mean | sd | n | missing | SE | t_crit | margin_of_error | lower | upper |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no | 105 | 650 | 900 | 1275 | 12000 | 1080.796 | 842.0805 | 690 | 0 | 32.05746 | 1.963413 | 62.94203 | 1017.854 | 1143.738 |
| yes | 99 | 695 | 1000 | 1500 | 10000 | 1237.601 | 897.3270 | 956 | 0 | 29.02163 | 1.962451 | 56.95353 | 1180.648 | 1294.555 |
#conduct hypothesis test to see whether difference in price between the 2 groups
t.test(price~top_reviewed, data = listings_clean)
Welch Two Sample t-test
data: price by top_reviewed
t = -3.6262, df = 1536.5, p-value = 0.000297
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
-241.62686 -71.98477
sample estimates:
mean in group no mean in group yes
1080.796 1237.601
We now have a clearer picture of the data. Although we could not infer much from the entire distribution of the rating scores, by dividing them into the above categories we can safely conclude that apartments with ratings higher than 4.8 have higher prices on average. Both categories are left skewed since the mean is larger than the median and have large standard diviations, meaning that they are spread-out. This makes sense, since we saw before that prices might vary a lot for one rating, with both groups having many extreme values pushing the mean higher. We also tested whether the difference in price of those that are top reviewed and those that are not is statistically significant. We find that the confidence intervals of the mean prices of these two groupos do not overlap, so we can reject the null hypothesis that the difference is zero. This is confirmed by conducting a t-test, where we found that the p-value is 0.00114 < 0.05, again rejecting the null hypothesis. This means that having a rating above 4.8 plays a role in forecasting the price. Therefore, we will use this variable in order to build our model later.
One logical variable that might influence the price is instant_bookable that indicates whether the listing can be booked immediately or not. Let us first take a look at the distribution of listings according to this variable and afterwards investigate its relationship with the price.
#summary statistics of price based on instant_bookable variable
favstats(price~instant_bookable, data = listings_clean) %>%
kable(caption = "Statistics on whether the listing can be booked immediately",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| instant_bookable | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| FALSE | 99 | 699.0 | 1000 | 1500 | 10000 | 1237.172 | 917.0884 | 1405 | 0 |
| TRUE | 155 | 631.5 | 816 | 1200 | 12015 | 1045.256 | 962.4574 | 511 | 0 |
#plot distribution of instant_bookable variable
ggplot(listings_clean,
aes(y = factor(instant_bookable,
levels = c(TRUE, FALSE)),
fill = instant_bookable)) +
geom_bar() +
labs(title = "Distribution of listings based on instant bookable option",
x = "Number of listings",
y = "Instant bookable") +
theme_bw() +
theme(legend.position = "none") +
NULL#plot distribuion of price basedon two variables
ggplot(listings_clean,
aes(x = instant_bookable, y = price,
fill = instant_bookable,
alpha = 0.5)) +
geom_boxplot() +
labs(title = "Distribution of prices according to instantly bookable option",
x = "Price per night",
y = "Density") +
theme_bw() +
theme(legend.position = "none") +
NULLAlmost 3/4 of the properties are not instantly bookable. The boxplots of the two groups are both right skewed. It is interesting that the instantly bookable group has more outliers in the higher end than the group that is not instantly bookable. The output also shows that those listings that don’t accept instant bookings have a considerably higher mean and median, as well as a lower standard deviation - but this could be a result of the big difference in the number of observations. One would assume that most of the instantly bookable AirBnBs are actually cheaper as the hosts are not vetting the guests. The reason for a higher mean and median is likely a function of those reasons outlined above, i.e higher price drawing wealthier guests who tend not need to be vetted.
There are three different variables, namely neighbourhood, neighbourhood_cleansed and neighbourhood_group_cleansed, that tell us about the neighbourhood each listing is located in. Let us look at them in more detail first.
listings_clean %>%
select(neighbourhood, neighbourhood_cleansed, neighbourhood_group_cleansed) %>%
skim() %>%
kable(caption = "Brief summary on neighborhood-related variables",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | logical.mean | logical.count |
|---|---|---|---|---|---|---|---|---|---|---|
| character | neighbourhood | 864 | 0.5490605 | 17 | 45 | 0 | 65 | 0 | NA | NA |
| character | neighbourhood_cleansed | 0 | 1.0000000 | 6 | 22 | 0 | 14 | 0 | NA | NA |
| logical | neighbourhood_group_cleansed | 1916 | 0.0000000 | NA | NA | NA | NA | NA | NaN | : |
Based on the output above we can conclude that the most informative variable is neighbourhood_cleansed as it does not contain any missing values compared to the other three variables that have more than 1,000 NAs, with neigbourhood_group_cleansed even consisting only of NAs. The n_unique variable from the output shows that there are 14 different neighbourhoods.
#check different neighborhoods given
unique(listings_clean$neighbourhood_cleansed) %>%
kable(caption = "Neighbourhoods of AirBnB listings in Stockholm",
align = "c") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F) %>%
row_spec(0, bold = T) %>%
column_spec(1, width = "40%")| x |
|---|
| Södermalms |
| Skarpnäcks |
| Norrmalms |
| Farsta |
| Rinkeby-Tensta |
| Älvsjö |
| Enskede-Årsta-Vantörs |
| Bromma |
| Kungsholmens |
| Östermalms |
| Hägersten-Liljeholmens |
| Skärholmens |
| Hässelby-Vällingby |
| Spånga-Tensta |
This is a little bit too much, which is why we want to group this variable into a smaller number of categories. We therefore looked at the different neighbourhoods included and identified 6 categories, which was subsequently reduced to 5 as one different fit any of the geographic attributes of neighbourhoods observed. We select the categories using a simple geographic check and so the consolidate groupings are:
#create vectors with regions included in bigger categories
Central <- c("Kungsholmens", "Norrmalms", "Östermalms", "Södermalms")
North <- c("Rinkeby-Tensta", "Hässelby-Vällingby", "Spånga-Tensta")
South <- c("Enskede-Årsta-Vantörs", "Farsta", "Hägersten-Liljeholmens")
West <- c("Skärholmens", "Bromma")
VerySouth <- c("Älvsjö", "Skarpnäcks")
#create new variable city_area
listings_clean <- listings_clean %>%
mutate(city_area = ifelse(neighbourhood_cleansed %in% Central, "Central",
ifelse(neighbourhood_cleansed %in% North, "North",
ifelse(neighbourhood_cleansed %in% South, "South",
ifelse(neighbourhood_cleansed %in% West, "West",
"VerySouth")))))
#check new variable
listings_clean %>%
count(city_area) %>%
mutate(percent = round(n/sum(n)*100, 2)) %>%
arrange(desc(percent)) %>%
kable(caption = "Summary of which region the property is located",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F) %>%
row_spec(0, bold = T)| city_area | n | percent |
|---|---|---|
| Central | 1135 | 59.24 |
| South | 367 | 19.15 |
| North | 145 | 7.57 |
| VerySouth | 145 | 7.57 |
| West | 124 | 6.47 |
We can see that most of the listings are located in the central region, followed by the South. We can now continue to investigate whether there is a price difference.
#summary statistics of price basedon city area
favstats(price ~city_area, data = listings_clean) %>%
kable(caption = "Statistics on price of property in different regions",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| city_area | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Central | 99 | 790 | 1050 | 1500.0 | 12000 | 1319.9648 | 1000.5636 | 1135 | 0 |
| North | 200 | 450 | 650 | 899.0 | 12015 | 846.2138 | 1043.3315 | 145 | 0 |
| South | 234 | 500 | 800 | 1199.5 | 5000 | 1012.9210 | 713.2379 | 367 | 0 |
| VerySouth | 175 | 510 | 918 | 1200.0 | 4000 | 1044.1310 | 692.1089 | 145 | 0 |
| West | 237 | 600 | 800 | 1325.0 | 5000 | 1035.0887 | 732.6426 | 124 | 0 |
#plot distribution of price for city areas
ggplot(listings_clean,
aes(x = city_area, y = price,
fill = city_area)) +
geom_boxplot() +
#facet_wrap(~city_area, scales = "free") +
labs(
title = "Distribution of price by area",
x = "Price",
y = "Density",
fill = NULL)+
theme(legend.position = "none") +
NULLBy re-categorising the dataset into 5 neighbourhoods locations, we can plot the prices to see if there is any relationship. As is the case with most cities, centrally located apartments tend to have the highest prices. Indeed, in Stockholm we find that AirBnBs in Central areas have higher prices on average with many apartments costing up to 5,000 SEK. This is most likely related to the fact that most of the attractions tourists want to visit are in the center of Stockholm. Therefore it makes sense that most AirBnBs are in that area, since they are most profitable for owners. Another interesting finding is that AirBnB prices tend to be higher on average the furhter South the apartment is located, with apartments in the West also having high prices, although we have a small sample of them.
The variable amenities lists all the extra features each apartment/house offers its guests, such as WiFi, kitchen, hot water and heating. These features are mostly desired by guests and therefore we believe the availability of amenities could be an important factor of listing prices. Since the amenities variable is currently given as a string list and range of items is quite extensive, we convert it into a quantifiable variable and our approach is to count how many amenities a listing has.
#count the number of elements in split amenities list
listings_clean <- listings_clean %>%
mutate(number_of_amenities = lengths(strsplit(amenities,",")))
#show number_of_amenities
listings_clean %>%
select(amenities) %>%
skim() %>%
kable(caption = " Characteristcs of amenities list",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace |
|---|---|---|---|---|---|---|---|---|
| character | amenities | 0 | 1 | 14 | 1705 | 0 | 1838 | 0 |
Let’s take a look at how number_of_amenities is distributed across all listings and also calculate its correlation coefficient with the price in order to get an idea of its relevance.
#summary statistics of number_of_amenities
favstats(~number_of_amenities, data = listings_clean) %>%
kable(caption = "Statistics on number of amenities",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 15 | 23 | 32 | 88 | 24.81002 | 12.19067 | 1916 | 0 |
#distribution of number_of_amenities
ggplot(listings_clean, aes(number_of_amenities)) +
geom_histogram(fill = "purple",
color = "black")+
labs(
title = "Distribution of number of amenities",
x = "Number of amenities",
y = NULL)+
theme_bw()+
NULL#calculate correlation between number_of_amenities and price
listings_clean %>%
select(number_of_amenities, price) %>%
cor(use = "na.or.complete") number_of_amenities price
number_of_amenities 1.00000000 0.06996548
price 0.06996548 1.00000000
Different from what we expected, number_of_amenities seem to have little correlation with listing prices. We think this might be because the amenities column in the original dataset is not informative enough: some host may not bother to put all amenities they have on the listing page, while some others might exaggerate by including some insignificant items such as shampoo, as we have seen in the above table where some houses and apartments only have single-digit amenities and some others have more than 50. This might also explain the high variablity of the number of amenities as shown in the above histogram.
Nonetheless, we might still include the number_of_amenities into one of our models as it could intuitively influence demand on each listing.
As a last step, we also want to look at the different characteristics of the people offering their apartments and houses. We identified the variables host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost and host_listings_count to be of potential interest. Let us first have a more detailed look at these variables.
listings_clean %>%
select(host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_listings_count) %>%
skim() %>%
kable(caption = "Brief summary on host-related variables",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | logical.mean | logical.count | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | host_response_time | 0 | 1 | 3 | 18 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_rate | 0 | 1 | 2 | 4 | 0 | 42 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_acceptance_rate | 0 | 1 | 2 | 4 | 0 | 83 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_is_superhost | 0 | 1 | NA | NA | NA | NA | NA | 0.2077244 | FAL: 1518, TRU: 398 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | host_listings_count | 0 | 1 | NA | NA | NA | NA | NA | NA | NA | 5.466597 | 38.88703 | 0 | 1 | 1 | 2 | 1106 | ▇▁▁▁▁ |
As we can see from the output above, the two variables host_response_rate and host_acceptance_rate that should be numeric variables are given as character variables. We will therefore change them before we continue with our analysis.
#change format of variables to numeric
listings_clean <- listings_clean %>%
mutate(host_response_rate = ifelse(host_response_rate != "N/A",
parse_number(host_response_rate),
NA),
host_acceptance_rate = ifelse(host_acceptance_rate != "N/A",
parse_number(host_acceptance_rate),
NA))
typeof(listings_clean$host_response_rate)[1] "double"
typeof(listings_clean$host_acceptance_rate)[1] "double"
One variable that stands out to us is the logical variable host_is_superhost. According to AirBnB’s website, superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests. Therefore, we would expect listings of superhosts to be more expensive than those of non-superhosts. Let us first check how many listings of superhosts we have in our dataset and then see whether there is a difference in the price of these listings.
# count number of superhosts
listings_clean %>%
count(host_is_superhost) %>%
mutate(percent = round(n/sum(n) * 100, 2)) %>%
arrange(desc(percent)) %>%
kable(caption = "Summary of whether host is superhost",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F) %>%
row_spec(0, bold = T)| host_is_superhost | n | percent |
|---|---|---|
| FALSE | 1518 | 79.23 |
| TRUE | 398 | 20.77 |
favstats(price ~host_is_superhost, data = listings_clean) %>%
kable(caption = "Statistics on whether host is superhost",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| host_is_superhost | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| FALSE | 105 | 688.75 | 981.0 | 1500 | 12015 | 1206.740 | 915.1294 | 1518 | 0 |
| TRUE | 99 | 613.75 | 881.5 | 1200 | 12000 | 1106.837 | 995.5642 | 398 | 0 |
#plot price distribution based on superhosts
listings_clean %>%
filter(!is.na(host_is_superhost)) %>%
ggplot(aes(x = host_is_superhost, y = price,
fill = host_is_superhost,
aes = 0.4)) +
geom_boxplot() +
theme_bw() +
labs(title = "Density plot of price for superhosts and non-superhosts",
x = "Price",
y = NULL) +
NULLSurprisingly, apartments that do not have superhosts seem to have higher prices on average. In fact, both the mean and the median of those that do not have superhost is higher than that of those that do. Even the single maximum price of AirBnB does not have a superhost. Therefore it seems that the relationship between having a superhost and having a higher price is inversely related.
One thing we need to note though is that only 21% of the apartments have a superhost, a fairly small sample. Perhaps it could be the case that flat hosts just begun superhosting in hopes of getting a higher price for the flat. Conversely, those that do not superhost might already have high bids so they do not need to offer this to their customers to keep the price high. We will proceed with calculating confidence intervals for the superhosts. Our null hypothesis is that there is no difference in price between the two groups.
#calculate CIs for price of superhosts and non-superhosts
listings_clean_CI <- listings_clean %>%
drop_na(host_is_superhost) %>%
group_by(host_is_superhost) %>%
summarise(
mean_price = mean(price),
n = n(),
SE = sd(price)/sqrt(n),
t_critical = qt(0.975, (n-1)),
lower = mean_price - t_critical * SE,
upper = mean_price + t_critical * SE
)
listings_clean_CI %>%
kable(caption = "Statistics on whether host is superhost",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| host_is_superhost | mean_price | n | SE | t_critical | lower | upper |
|---|---|---|---|---|---|---|
| FALSE | 1206.740 | 1518 | 23.48803 | 1.961529 | 1160.668 | 1252.813 |
| TRUE | 1106.837 | 398 | 49.90313 | 1.965957 | 1008.729 | 1204.944 |
We can tell that the the CIs overlap, so we will test whether the difference in price between the flats that have superhost and those that do not is significant.
t.test(price ~ host_is_superhost, data = listings_clean)
Welch Two Sample t-test
data: price by host_is_superhost
t = 1.8113, df = 584.87, p-value = 0.0706
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
-8.42107 208.22860
sample estimates:
mean in group FALSE mean in group TRUE
1206.740 1106.837
Since the p-value is larger than 0.05 we cannot reject the null hypothesis. Therefore, the difference in price between these two groups is not statistically significant and the variable host_is_superhost will most likely not be included in the final model.
As a next step, let us look at the numeric variables we have for the hosts and check whether there is a certain correlation among them and price. We do this again with the help of the ggpairs() function.
listings_clean %>%
select(host_response_rate, host_acceptance_rate, host_listings_count, price) %>%
ggpairs()From the above output we can see that, whiel the acceptance rate of hosts and response rate are correlated with each other, the correlations of the three variables with the price are all extremely close to 0. This is why we will not look at them in more detail, also because we do not want to overfit our model later with too many variables.
We still want to have a look at the categorical variable host_response_time and check in a denisty plot whether there is a difference in response time among the different categories.
#bring all NA values into consistent format
listings_clean <- listings_clean %>%
mutate(host_response_time = ifelse(host_response_time == "N/A", NA, host_response_time))
#count different response times
listings_clean %>%
filter(!is.na(host_response_time)) %>%
count(host_response_time) %>%
mutate(percent = round(n/sum(n)*100, 2)) %>%
arrange(desc(percent)) %>%
kable(caption = "Summary of host response time",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
full_width = F) %>%
row_spec(0, bold = T)| host_response_time | n | percent |
|---|---|---|
| within an hour | 739 | 55.15 |
| within a day | 259 | 19.33 |
| within a few hours | 224 | 16.72 |
| a few days or more | 118 | 8.81 |
We can see that more than half of the hosts respond within 1 hour. We are curious whether the price of those listings with fast-responding hosts is different, so will plot the distribution of price, including the confidence intervals of the mean price per category, based on the four identified categories.
#Plot distribution of price based on response time of host
listings_clean %>%
filter(!is.na(host_response_time)) %>%
ggplot(aes(x = price, fill = host_response_time)) +
geom_density() +
facet_wrap(~host_response_time) +
theme_bw() +
labs(title = "Distribution of price basedon response time of host",
x = "Price",
y = "Density",
fill = "Host response time") +
NULL#look at distribution of price across response times and calculate CIs
favstats(price ~ host_response_time, data = listings_clean) %>%
mutate(SE = sd/sqrt(n),
t_crit = qt(0.975, n-1),
margin_of_error = t_crit * SE,
lower = mean - margin_of_error,
upper = mean + margin_of_error) %>%
kable(caption = "Statistics on price and host response time",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| host_response_time | min | Q1 | median | Q3 | max | mean | sd | n | missing | SE | t_crit | margin_of_error | lower | upper |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a few days or more | 250 | 658.5 | 999.5 | 1476.50 | 5400 | 1128.619 | 764.9181 | 118 | 0 | 70.41642 | 1.980448 | 139.45603 | 989.1626 | 1268.075 |
| within a day | 167 | 790.0 | 1100.0 | 1517.00 | 10000 | 1310.560 | 958.5149 | 259 | 0 | 59.55922 | 1.969201 | 117.28409 | 1193.2758 | 1427.844 |
| within a few hours | 237 | 654.5 | 914.5 | 1502.25 | 10000 | 1205.286 | 945.5702 | 224 | 0 | 63.17857 | 1.970659 | 124.50341 | 1080.7823 | 1329.789 |
| within an hour | 99 | 650.0 | 938.0 | 1317.00 | 12000 | 1143.510 | 917.3212 | 739 | 0 | 33.74421 | 1.963184 | 66.24608 | 1077.2641 | 1209.756 |
All five categories seem to have similar prices, with the means ranging from 1129 - 1311 SEK. Although, there is somewhat of an inverse relationship between price and time to reply, we find that all CIs overlap. Thus we cannot reject the null hypothesis that the difference in price between the five groups is on average zero. Therefore, we also expect that this variable will be useful in our regression analysis.
Now that we have investigated all the features that might be of interest for predicting the price of a 4-night stay for 2 people in Stockholm, we can reduce the number of features we will include in our regression models. The variables we consider as potential explanatory variables for price of an AirBnB in Stockholm are:
prop_type_simplifiedaccommodatesroom_type, bathrooms, bedrooms, beds, number_of_amenitiesavailability_30number_of_reviews, reviews_per_month,reviews_scores_rating, reviews_30_plus, top_reviewedinstant_bookablecity_areahost_is_superhost, host_response_time (both rather unlikely to have effect)Before we continue with building our model, we will first look at the correlation among all the numeric variables we have. We do this with the help of ggpairs().
#select variables to include in ggpairs()
listings_clean %>%
dplyr::select(price, accommodates, bathrooms, bedrooms, beds,
number_of_amenities, availability_30, number_of_reviews,
reviews_per_month, review_scores_rating) %>%
ggpairs()From the above output we can see that the majority of variables does not have a very strong linear relationship with the price. The variable with the highest correlation with the price is the number of bedrooms (0.412), followed by the number of people each apartment/house accommodates (correlation coefficient of 0.383) and the number of beds (0.299). Since these variables are highly correlated with each other, however, (correlation coefficient of 0.760 for accommodates and bedrooms and 0.739 for bedrooms and beds) we will need to be careful when we include them in our model as we want to avoid any collinearity. While the correlation coefficients between the price and some of the variables may not be very strong, they might still turn out to be valuable predictors in our model later as the correlation might still be significant. We see from the above scatterplots that many of the variables are not correlated at all (e.g., bed and reviews_per_month), which is a good thing as we want to avoid collinearity. Let us have a look at whether we can find a stronger linear relationship between the price and other variables by accounting for the categorical variable city_area that contains information about the location of each listing.
ggplot(listings_clean, aes(x = accommodates, y = price)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
facet_wrap(~city_area) +
theme_bw() +
labs(title = "Relationship between people capacity of listing and price split by geographic area",
x = "People capacity of listing",
y = "Price") +
NULLFrom the above we can see that the geographic area of the listing does not seem to be a condition for the positive relationship between the capacity of the listing (accommodates variable) and its price. We have not only tested such conditional relationship for the accommodates variable, but also for all other numeric variables and also changed the categorical variables that might pose a condition. We were not able to find any outstanding conditional relationship. Only for the type of the property, it seems that Entire rental units and Other show a higher positive correlation between the capacity and price than the other four types as can be seen below.
ggplot(listings_clean, aes(x = accommodates, y = price)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
facet_wrap(~prop_type_simplified) +
theme_bw() +
labs(title = "Relationship between people capacity of listing and price split by property type",
x = "People capacity of listing",
y = "Price") +
NULLBefore we start with building our model, let us include an interactive map of Stockholm that includes the AirBnB listings of Stockholm where minimum_nights is less than equal to four (except the very high outliers and those with a price of 0 SEK that we removed above). This will allow us to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps.
#set continuous color palette with log_price
pal <- colorNumeric(
palette = "YlOrRd",
domain = listings_clean$price)
#create map
leaflet(data = listings_clean,
width = "100%",
height = '300px') %>%
addTiles() %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(price),
fillOpacity = 0.5,
#set an informative popup interface
popup = paste("ID:", listings_clean$id, "<br>",
"Room type:", listings_clean$room_type, "<br>",
"Accommodates:", listings_clean$accommodates, "people", "<br>",
"Price (in SEK):", listings_clean$price, "<br>",
"URL:", listings_clean$listing_url),
label = ~name) %>%
#add legend to show how colors differ with prices
addLegend(position = "bottomright",
pal = pal,
values = ~price,
title = "Price")Now that we have identified our features, have a good understanding of the relationships among them and the geographical outlay of the listings in Stockholm, let us start with building our model. The aim is to predict the price of a 4-night stay for 2 people in Stockholm. In our analysis, we include all apartments/houses/rooms that have capacity for more than 1 person, meaning that houses that fit 10 people will also be considered in order to predict the price.
Since we do not need all the variables that are contained in the dataset listings_clean, we will first condense our dataset so that it only includes the variables that we might use as explanatory variables. We call this dataset listings_condensed. In addition, since we want to predict the price for two people traveling and staying together in one apartment/house, we will clean the dataframe from all listings that do only fit one person. We also create a new variable called price_4_nights that calculates the price for the apartment/house for four nights and will be the dependent variable in our regression analysis.
#condense the dataset
listings_condensed <- listings_clean %>%
select(prop_type_simplified, accommodates, room_type, bathrooms, bedrooms, beds, number_of_amenities,
availability_30, number_of_reviews, reviews_per_month, review_scores_rating, reviews_30_plus,
top_reviewed, instant_bookable, city_area, host_is_superhost, price, host_response_time) %>%
#remove those listings that only accommodate one person
filter(accommodates >= 2) %>%
#drop all rows with any NA values
#drop_na() %>%
#calculate price_4_nights
mutate(price_4_nights = price*4)
#take look at condensed dataset
skim(listings_condensed) %>%
kable(caption = "Brief summary on all variables in our condensed dataset",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
fixed_thead = T) %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%", height = "400px")| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | factor.ordered | factor.n_unique | factor.top_counts | logical.mean | logical.count | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | room_type | 0 | 1.0000000 | 10 | 15 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | reviews_30_plus | 0 | 1.0000000 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | top_reviewed | 270 | 0.8590814 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | city_area | 0 | 1.0000000 | 4 | 9 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_time | 576 | 0.6993737 | 12 | 18 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | prop_type_simplified | 0 | 1.0000000 | NA | NA | NA | NA | NA | FALSE | 5 | Ent: 1128, Oth: 324, Pri: 243, Ent: 111 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | instant_bookable | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | 0.2667015 | FAL: 1405, TRU: 511 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_is_superhost | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | 0.2077244 | FAL: 1518, TRU: 398 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | accommodates | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.409186 | 1.8026957 | 2.00 | 2.00 | 3.00 | 4.0000 | 16.00 | ▇▂▁▁▁ |
| numeric | bathrooms | 7 | 0.9963466 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.304872 | 0.7102738 | 0.00 | 1.00 | 1.00 | 1.5000 | 9.00 | ▇▁▁▁▁ |
| numeric | bedrooms | 207 | 0.8919624 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.613224 | 1.0478430 | 1.00 | 1.00 | 1.00 | 2.0000 | 15.00 | ▇▁▁▁▁ |
| numeric | beds | 18 | 0.9906054 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.099578 | 1.7479832 | 0.00 | 1.00 | 2.00 | 3.0000 | 30.00 | ▇▁▁▁▁ |
| numeric | number_of_amenities | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 24.810021 | 12.1906707 | 1.00 | 15.00 | 23.00 | 32.0000 | 88.00 | ▆▇▃▁▁ |
| numeric | availability_30 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.433194 | 10.8929385 | 0.00 | 0.00 | 1.00 | 16.0000 | 30.00 | ▇▁▁▁▂ |
| numeric | number_of_reviews | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 30.279750 | 55.0926305 | 0.00 | 2.00 | 10.00 | 33.0000 | 593.00 | ▇▁▁▁▁ |
| numeric | reviews_per_month | 270 | 0.8590814 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.342861 | 1.7974469 | 0.01 | 0.28 | 0.69 | 1.7375 | 21.49 | ▇▁▁▁▁ |
| numeric | review_scores_rating | 270 | 0.8590814 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.697066 | 0.6426358 | 0.00 | 4.67 | 4.84 | 5.0000 | 5.00 | ▁▁▁▁▇ |
| numeric | price | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1185.987996 | 933.0213066 | 99.00 | 654.75 | 950.00 | 1490.0000 | 12015.00 | ▇▁▁▁▁ |
| numeric | price_4_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4743.951983 | 3732.0852265 | 396.00 | 2619.00 | 3800.00 | 5960.0000 | 48060.00 | ▇▁▁▁▁ |
Let us have a look at the distribution of the price_4_nights variable first.
#summary statistics for price_4_nighs
favstats(~price_4_nights, data = listings_condensed) %>%
kable(caption = "Statistics on target cost",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 396 | 2619 | 3800 | 5960 | 48060 | 4743.952 | 3732.085 | 1916 | 0 |
#density plot for price_4_nights
ggplot(listings_condensed, aes(x = price_4_nights)) +
geom_density(fill="pink")+
labs(x = "Price for four nights",
y = "Density",
title = "Distribution of price for a four-night stay") +
theme_bw()+
NULLDespite some really expensive outliers, we can see from the above graph that most of the AirBnB listings for a four-night stay in Stockhom are below SEK 5,000. This is also confirmed by the the mean price of 4,744 SEK. We believe the distribution would provide more insights if we transform price_4_nights to a logarithmic scale.
#transform price_4_nights to log scale
listings_condensed <- listings_condensed %>%
mutate(log_price_4_nights = log(price_4_nights))
#calculate summary statistics for new variable
favstats(~log_price_4_nights, data = listings_condensed) %>%
kable(caption = "Statistics on logged target cost",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 5.981414 | 7.870548 | 8.242756 | 8.692826 | 10.78021 | 8.264004 | 0.6180196 | 1916 | 0 |
#density plot for log_price_4_nights
ggplot(listings_condensed, aes(x = log_price_4_nights)) +
geom_density(fill="lightblue")+
labs(x = "Logarithmic price for four nights",
y = "Density",
title = "Distribution of logarithmic price for a four-night stay") +
theme_bw()+
NULL We can observe from the above graph that the distribution of
log_price_4_nights is very similar to a normal distribution, with its mean and median both around 8.26, implying a cost of \(e^{8.26}=\) 3,866 SEK for two people staying for 4 nights. log_price_4_nights is approximately normally distributed, which gives it more favourable statistical attributes and better interpretability than price_4_nights. Therefore, we will use log_price_4_nights as our dependent variable in our regression models.
Our first regression model will have the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating. We believe these three variables that contain information about the type of the property, how many reviews the listing has got and about the average rating, are what most guests would check and use in deciding on the accommodation for their stay.
#create a regression model
model1 <- lm(log_price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = listings_condensed)
#display the result of the model
model1 %>%
tidy() %>%
kable(caption = "Model 1 coefficients",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 8.3397381 | 0.0991186 | 84.1389972 | 0.0000000 |
| prop_type_simplifiedPrivate room in rental unit | -0.7059753 | 0.0418737 | -16.8596404 | 0.0000000 |
| prop_type_simplifiedEntire residential home | 0.4424801 | 0.0589185 | 7.5100336 | 0.0000000 |
| prop_type_simplifiedEntire condominium (condo) | 0.0439675 | 0.0627725 | 0.7004266 | 0.4837604 |
| prop_type_simplifiedOther | -0.2535255 | 0.0378812 | -6.6926533 | 0.0000000 |
| number_of_reviews | -0.0004236 | 0.0002328 | -1.8198721 | 0.0689608 |
| review_scores_rating | 0.0075617 | 0.0208553 | 0.3625772 | 0.7169675 |
model1 %>%
glance() %>%
kable(caption = "Model 1 fit",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.2026959 | 0.1997771 | 0.5422036 | 69.44622 | 0 | 6 | -1324.526 | 2665.052 | 2708.301 | 481.841 | 1639 | 1646 |
Looking at our model, we can see that not all of the variables are significant. In this initial model there are three insignificant variables. Using Entire condominium as a predictor variable is shown to have an insignificant impact on the dependent variable, evidenced by its t-statistic of just 0.7 and a p-value of almost 0.5. It is possible that, in this case, the other prop_type_simplified variables in use account for some of the variability in price for various property types. The other two insignificant variables, number_of_reviews and review_scores_rating, are surprising outcomes. One would expect good reviews and the number of them to have a material impact on the price a host can charge for a nights stay. An important note is that number_of_reviews was just insignificant, its p-value was 0.069 just 1.9ppt greater than one we would deem significant. It is also interesting to see that its coefficient is marginally negative, implying that a single extra review actually decreases the amount we would expect to pay for a nights stay. Review_scores_rating had a positive effect on expect price, however, as stated above, it was very insignificant having the lowest t-value and highest p-value out of the observed variables.
The significant variables and their coefficients fall in line with expectation. The intercept, or base case, is very significant with a coefficient 8.33 (c.4146kr) - remember we used a logarithmic scale - this is expected as it is just over our mean and median values meaning that the other variables either influence expectation away from the central value towards either a premium or a discount price. According to this model, renting a private room brings down the expected log price by c.0.7. This means that choosing a private room will bring down the expected 4-night price by about 50% - a considerable change! This is in line with expectation and earlier analysis, as you’d expect to pay less for a single room compared to an entire property. Contrarily, this model expects that staying at an entire residential home would increase the expected cost for the trip by around 56% compared to the base case, again in line with expectation for the opposite reason explained for a private room. Both of these prop_type variables are very significant with negligible p-values and high t-scores. The property type category Other is also heavily significant and possesses a negative coefficient of c.-0.25, implying a reduction in expected trip costs if they stay in a non-conventional property by approximately 22%. This again is somewhat expected when looking at what makes up some of the Other category, i.e boats and campervans, considering these properties might not have the same expenses as conventional properties and the fact that they’re niche and likely not centrally located.
We can also see that our first model is not very satisfying in explaining the differences in log_price_4_nights because it has an adjusted \(R^2\) of only approximately 0.2 and includes insignificant explanatory variables.
Even though our first model is not very reliable, we still want to run diagnostic plots as they will give us an idea of whether there exist factors we need to account for in order to better explain the price. We do this with the help of the autoplot function.
#plot residuals
autoplot(model1) +
theme_bw()From the residuals vs fitted plot we can see that residuals are not randomly distributed which indicates that there is some pattern that is currently not accounted for with our variables. The scale-location plot shows us that the variability is not constant. In addition, the normal q-q plot clearly shows that our residuals do not follow a normal distribution. This model therefore does not fit into LINE assumptions and we should add more variables to the model.
Further, we want to check if there is colinearity between the explanatory variables in our first model. We use vif() to achieve that.
#calculate VIF of the model
car::vif(model1) %>%
kable(caption = "VIF of model 1",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T)| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 1.017626 | 4 | 1.002187 |
| number_of_reviews | 1.017965 | 1 | 1.008943 |
| review_scores_rating | 1.005087 | 1 | 1.002540 |
The 3 explanatory variables at hand are not correlated to each other as all VIF scores are will below 5. There is thus no need to be concerned about colinearity at the moment.
As a next step, we think that the room_type could be another significant predictor for the price of an AirBnB listing in Stockholm. People should be willing to pay more for an entire apartment compared to a shared room. To verify this, we include room_type into our second model.
#create second regression model with all variables from model 1 + room_type
model2 <- lm(log_price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data = listings_condensed)
#display result of model
model2 %>%
tidy() %>%
kable(caption = "Model 2 coefficients",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 8.3596869 | 0.0956037 | 87.4410389 | 0.0000000 |
| prop_type_simplifiedPrivate room in rental unit | -0.1066975 | 0.0838812 | -1.2720069 | 0.2035514 |
| prop_type_simplifiedEntire residential home | 0.4429607 | 0.0568005 | 7.7985341 | 0.0000000 |
| prop_type_simplifiedEntire condominium (condo) | 0.0447556 | 0.0605159 | 0.7395685 | 0.4596680 |
| prop_type_simplifiedOther | 0.0647332 | 0.0480706 | 1.3466272 | 0.1782869 |
| number_of_reviews | -0.0004092 | 0.0002245 | -1.8228139 | 0.0685140 |
| review_scores_rating | 0.0031937 | 0.0201154 | 0.1587703 | 0.8738694 |
| room_typeHotel room | -0.4782957 | 0.1158764 | -4.1276379 | 0.0000385 |
| room_typePrivate room | -0.5993717 | 0.0735902 | -8.1447214 | 0.0000000 |
| room_typeShared room | -1.2085399 | 0.1312722 | -9.2063687 | 0.0000000 |
model2 %>%
glance() %>%
kable(caption = "Model 2 fit",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.2603483 | 0.2562793 | 0.5227113 | 63.98353 | 0 | 9 | -1262.755 | 2547.509 | 2606.976 | 446.9995 | 1636 | 1646 |
We can see adjusted \(R^2\) increases by 0.05ppt and the new categorical variable room_type is significant. The added variables for this model, the three room_type categories, are all statistically significantly different from zero as they all possess large t-values and very small p-values. Unsurprisingly, their introduction into the model has rendered the property type category Private room insignificant, this is almost certainly a result of the Private room category in room_type and so isn’t alarming. It is encouraging to see that these variables share the explanatory power. The room_type categories all have increasingly negative coefficients, perfectly in line with expectation. One would expect to pay a considerable discount on the average rate for a shared room as you must forego privacy and space if you choose one of these listings. A Hotel room is also shown to draw a discount from the base case, again this is likely expected because you are not renting an entire property but instead a single room, this negative coefficient will likely be mitigated at premium hotels by the amenities on offer.
The other variables from the initial model haven’t changed in significance, all but Entire residential home remain insignificant. As number_of_reviews and review_scores_rating are still insignificant, we may want to replace them with other explanatory variables in a later model.
Again, we want to look at the second model’s residuals and VIF.
#plot residuals
autoplot(model2) +
theme_bw()#calculate VIF of model
car::vif(model2) %>%
kable(caption = "VIF of model 2",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T)| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 5.442112 | 4 | 1.235865 |
| number_of_reviews | 1.018901 | 1 | 1.009406 |
| review_scores_rating | 1.006072 | 1 | 1.003032 |
| room_type | 5.388214 | 3 | 1.324059 |
Compared to our first model, the residual vs. fitted and the normal q-q plot have not really changed, which means that residuals are not random and our second model does thus also not capture all variables that contribute to the price formation of Stockholm’s AirBnB listings. Unfortunately, we also observe a high level of multi-collinearity between prop_type_simplified and room_type. As outlined above prop_type_simplified and room_type are unsurprisingly related. The Private Room category in prop_type_simplified is almost directly related to room_type Private Room and so a high level of linear relation between the two variables (multi-collinearity) is expected. Knowing this, we now must remove one of the variables from future models. Considering room_type has higher t-value as shown above and offers better intuitive explanation than prop_type_simplified, we decide to discard the latter in our further modelling.
Considering room_type has higher t-value as shown above and offers better intuitive explanation than prop_type_simplified, we decide to discard the latter in our further modelling.
We think it would be great if we include some variables that represent the availability of resources in the property, including how many people it fits, how many rooms and beds it has and how many amenities it has. Therefore, we add the variables accommodates, bathrooms, bedrooms, beds and number_of_amenities to our model.
Considering number_of_reviews and review_scores_rating have produced insignificant coefficients in our first two models, we decide to replace them with the two logical variables we created to capture AirBnB review differences, reviews_30_plus that indicates whether a listing has received 10 reviews or more and top_reviewed that shows whether a listing has a rating of at least 4.5. In addition, we also add the number of reviews_per_month as this is a better indicator of the frequency of reviews and might thus also influence price as frequently reviewed listings might gain more attention among customers.
#create model based on significant variables from model 2 and new variables
model3 <- lm(log_price_4_nights ~
room_type +
reviews_30_plus +
top_reviewed +
reviews_per_month +
accommodates +
bathrooms +
bedrooms +
beds +
number_of_amenities,
data = listings_condensed)
#display result of model
model3 %>%
tidy() %>%
kable(caption = "Model 3 coefficients",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 8.0334230 | 0.0432102 | 185.9150124 | 0.0000000 |
| room_typeHotel room | -0.1704017 | 0.1585747 | -1.0745828 | 0.2827405 |
| room_typePrivate room | -0.5711384 | 0.0341736 | -16.7128766 | 0.0000000 |
| room_typeShared room | -1.0320178 | 0.1209844 | -8.5301726 | 0.0000000 |
| reviews_30_plusyes | 0.0076541 | 0.0285106 | 0.2684668 | 0.7883782 |
| top_reviewedyes | 0.0669149 | 0.0255267 | 2.6213691 | 0.0088494 |
| reviews_per_month | -0.0175973 | 0.0074774 | -2.3533933 | 0.0187359 |
| accommodates | 0.0597042 | 0.0133886 | 4.4593258 | 0.0000089 |
| bathrooms | -0.0296115 | 0.0225873 | -1.3109771 | 0.1900732 |
| bedrooms | 0.1958223 | 0.0205449 | 9.5314166 | 0.0000000 |
| beds | -0.0134767 | 0.0129317 | -1.0421485 | 0.2975167 |
| number_of_amenities | -0.0038705 | 0.0010930 | -3.5412076 | 0.0004109 |
model3 %>%
glance() %>%
kable(caption = "Model 3 fit",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.437077 | 0.4328036 | 0.4677272 | 102.2786 | 0 | 11 | -956.8743 | 1939.749 | 2008.478 | 316.9959 | 1449 | 1461 |
Our adjusted \(R^2\) in the third model has gone above 0.4, which means that our model explains decisively more in the variability of price than the former two models. Nevertheless, we can see that some of the variables we added are not significant. Of the new variables added, reviews_30_plus, bathrooms and beds are all insignificant. Reviews_30_plus is very statistically insignificant, it has a p-value of c.0.79 the most insignificant so far. This outcome is somewhat unsurprising considering the insignificance of number_of_reviews in the previous model. The variables bathrooms and beds are also insignificant, this initially was surprising but when considering their correlation with accommodates and bedrooms, becomes less so. beds is highly correlated with bedrooms and so its predictive power will be demonstrated by the bedrooms variable. Bathrooms are not necessarily highly correlated with accommodates and bedrooms as they can be shared but likely change in line with these variables and so its predictive power is already represented in accommodates and bedrooms.
An important note to make is that room_type Hotel Room is now insignificant, this could be because of the introduction of the number_of_amenities variable. Another thing to note is that the intercept for this model is lower than in previous models, it now sits almost 1,000SEK below our mean and median cost for a 4 night stay. It is likely to have changed to accommodate the different variables we have added.
Top_reviewed is statistically significant up until the 99% confidence interval, due to its p-value of ~1%. It has a positive coefficient, in line with expectation was one would expect highly rated properties to command a premium over non-highly rated listings. Reviews_per_month is statistically significant up until the 97.5% confidence interval, it has a p-value of ~2%. Somewhat surprisingly, it has a negative coefficient, implying that an extra review per month will reduce the expect price of a listing. Although initially surprising, intuitively high turnover properties with higher rates of reviewing by customers are likely to be on the lower end of the price spectrum as the higher number of reviews gives customers a clearer picture of the property and can thereby decreae the pricing power of hosts. accommodates and bedrooms both have positive and highly significant coefficients, this is line with expectation as one would expect to pay a higher price for a listing with more capacity for guests and a greater number of beds. The number_of_amenities variable is also statistically significant but surprisingly it has a slightly negative coefficient, for each added amenity this model expects a listing to have a lower nightly price. This could be due to the fact that if a host has gone to the effort of listing all the amenities, especially unimportant ones like shampoo or kettle, then the chances are that the property isn’t premium or especially nice and so hosts are trying to make it sound more appealing. This attempt to upsell the property in the description is likely limited to cheaper properties and so a negative coefficient makes intuitive sense.
With the higher adjusted \(R^2\) value we are now curious to see whether the residuals of our model are now more random than for the previous two models. We also check again for colinearity with the vif function.
#display residuals
autoplot(model3) +
theme_bw()#calculate VIF of model 3
car::vif(model3) %>%
kable(caption = "VIF of model 3",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T)| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_type | 1.394386 | 3 | 1.056973 |
| reviews_30_plus | 1.166882 | 1 | 1.080223 |
| top_reviewed | 1.056313 | 1 | 1.027771 |
| reviews_per_month | 1.172415 | 1 | 1.082781 |
| accommodates | 4.130287 | 1 | 2.032311 |
| bathrooms | 1.441641 | 1 | 1.200684 |
| bedrooms | 2.930998 | 1 | 1.712016 |
| beds | 3.180802 | 1 | 1.783480 |
| number_of_amenities | 1.168305 | 1 | 1.080882 |
We are glad to see that residuals are clearly more randomly distributed than what they appeared to be in the previous two models except for one outlier in the right end. The VIF output also shows that we do not have any significant multi-collinearity at the moment. We had expected to detect some collinearity between accommodates, bedrooms and bed as we saw that they were relatively highly correlated in our explanatory data analysis. While their VIF is below 5 for all of these variables, bed is also an insignificant variable, so we will discard it in our next model in any case.
The next thing we want to incorporate into our model is city_area, a variable that gives us information about each listing’s geographic location in Stockholm. We also want to remove some insignificant variables from model 3, including reviews_30_plus, reviews_per_month, bathrooms and beds.
#create model with new variable city_area and discarding reviews variables and bathrooms
model4 <- lm(log_price_4_nights ~
room_type +
top_reviewed +
accommodates +
bedrooms +
number_of_amenities +
city_area,
data = listings_condensed)
#display results of model
model4 %>%
tidy() %>%
kable(caption = "Model 4 coefficients",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 8.0329026 | 0.0354442 | 226.635057 | 0.0000000 |
| room_typeHotel room | -0.3454839 | 0.1316136 | -2.624986 | 0.0087558 |
| room_typePrivate room | -0.4777714 | 0.0312466 | -15.290362 | 0.0000000 |
| room_typeShared room | -1.0425863 | 0.1032754 | -10.095205 | 0.0000000 |
| top_reviewedyes | 0.0505581 | 0.0233745 | 2.162959 | 0.0307063 |
| accommodates | 0.0379789 | 0.0100785 | 3.768310 | 0.0001709 |
| bedrooms | 0.2453846 | 0.0181343 | 13.531491 | 0.0000000 |
| number_of_amenities | -0.0018227 | 0.0009952 | -1.831422 | 0.0672415 |
| city_areaNorth | -0.5592624 | 0.0482544 | -11.589870 | 0.0000000 |
| city_areaSouth | -0.3583494 | 0.0305813 | -11.717907 | 0.0000000 |
| city_areaVerySouth | -0.2836931 | 0.0431686 | -6.571747 | 0.0000000 |
| city_areaWest | -0.4114315 | 0.0473853 | -8.682677 | 0.0000000 |
model4 %>%
glance() %>%
kable(caption = "Model 4 fit",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5238408 | 0.5202509 | 0.4309305 | 145.9183 | 0 | 11 | -842.9338 | 1711.868 | 1780.686 | 270.9379 | 1459 | 1471 |
Our adjusted \(R^2\) increases further, now to more than 50%, which means we are right about the fact that city_area should have an impact on the prices of AirBnB listings in Stockholm. The city_area variable is fully statistically significant, all of its categories are highly statistically significant and so they play an important role in predicting prices for listings across the city. The coefficients on the various areas show the expected change in average price for each region compared to the Central base assumption. So if a listing is in the North, we expect it to be around 43% (\((e^{-0.5592624}-1)\cdot 100\)) cheaper than the average price of a central listing. In line with our understanding and intuition, our model expects a lower average price in all areas of the city relative to the central area, like in most cities. A less negative coefficient on Very South versus the other areas suggest that it is a more expensive suburb than the others.
The number_of_amenities variable in the meantime has become insignificant in the new model, in line with what we have inferred in the EDA section. Let us again check the distribution of residuals as well as the VIF scores for the variables included in our fourth model.
#display residuals
autoplot(model4) +
theme_bw()#calculate VIF for model 4
car::vif(model4) %>%
kable(caption = "VIF of model 4",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T)| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_type | 1.192688 | 3 | 1.029804 |
| top_reviewed | 1.050989 | 1 | 1.025177 |
| accommodates | 2.768178 | 1 | 1.663784 |
| bedrooms | 2.696477 | 1 | 1.642095 |
| number_of_amenities | 1.154029 | 1 | 1.074257 |
| city_area | 1.135969 | 4 | 1.016063 |
Looking at the residuals vs. fitted plot, our residuals now seem to be randomly distributed. The normal q-q plot, however, still indicates that the residuals are not perfectly normally distributed. The VIF scores are all well below 5. Therefore, our fourth model appears to be free from undesirable conditions like collinearity or heteroscedasticity.
Given that we live in an area where we cannot plan our vacation too far in advance, we also want to see if the property’s (short-term) availability is something that would make it more appealing to guests and thus drive up prices. instant_bookable and availability_30 will be the next two new variables we add to our model. Since the number_of_amenities from the previous model showed little significance for the prices, we will also exclude it from our model this time.
#create model with availability information and exclude number_of_amenities
model5 <- lm(log_price_4_nights ~
room_type +
top_reviewed +
accommodates +
bedrooms +
city_area +
instant_bookable +
availability_30,
data = listings_condensed)
#display results of model
model5 %>%
tidy() %>%
kable(caption = "Model 5 coefficients",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 7.9103135 | 0.0322295 | 245.436711 | 0.0000000 |
| room_typeHotel room | -0.5473035 | 0.1289124 | -4.245545 | 0.0000232 |
| room_typePrivate room | -0.5054525 | 0.0303171 | -16.672201 | 0.0000000 |
| room_typeShared room | -1.1666135 | 0.1002818 | -11.633352 | 0.0000000 |
| top_reviewedyes | 0.0748138 | 0.0226041 | 3.309739 | 0.0009566 |
| accommodates | 0.0280673 | 0.0095704 | 2.932719 | 0.0034125 |
| bedrooms | 0.2632345 | 0.0176167 | 14.942350 | 0.0000000 |
| city_areaNorth | -0.5928187 | 0.0465691 | -12.729884 | 0.0000000 |
| city_areaSouth | -0.3713011 | 0.0292096 | -12.711611 | 0.0000000 |
| city_areaVerySouth | -0.3132927 | 0.0415982 | -7.531392 | 0.0000000 |
| city_areaWest | -0.4149656 | 0.0453085 | -9.158667 | 0.0000000 |
| instant_bookableTRUE | -0.0472744 | 0.0259238 | -1.823588 | 0.0684190 |
| availability_30 | 0.0120122 | 0.0011028 | 10.892150 | 0.0000000 |
model5 %>%
glance() %>%
kable(caption = "Model 5 fit",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5597145 | 0.5560907 | 0.4145217 | 154.4573 | 0 | 12 | -785.3229 | 1598.646 | 1672.758 | 250.5255 | 1458 | 1471 |
While the adjusted \(R^2\) has climbed up further to now more than 55.61%, one of our new variables, namely instant_bookable is not significant as it only has a p-value of 6.84% which is above our significance level of 5%. However, we can see that the variable availability_30 is significant as its p-value is less than 5%. The instant_bookable variable could be not significant because the instant_bookable feature is a feature that any host can turn on. That creates two different ranges of AirBnBs that are instantly bookable. One being the very cheap AirBnBs that choose not to vet the incoming guests, and the other being the very high end locales whose prices ensure that the vetting process is not needed as the very high price prevents the people who would have been rejected in the vetting process from making the instant booking. The availability_30 variable on the other hand, is significant, which is surprising as the correlation with price is almost negligible (as calculated earlier). The effect is still very small though as a one day increase in availability increases the price by \((e^{0.012012} - 1) \cdot 100 = 1.21\%\). The fact that the price increases when the apartment is more available in the next 30 days could be because rooms that are available in the short-term are normally those that are priced higher as booking far in advance normally comes with a discount (same principle as when booking a flight or hotel room 6 months in advance instead of 2 days).
Given that we have added new variables to our model, we are concerned of overfitting which is why we again check the residual plots and calculate the VIF scores for model5.
#display residuals
autoplot(model5) +
theme_bw()#calculate VIF for model 5
car::vif(model5) %>%
kable(caption = "VIF of model 5",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T)| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_type | 1.273913 | 3 | 1.041174 |
| top_reviewed | 1.062204 | 1 | 1.030633 |
| accommodates | 2.697631 | 1 | 1.642447 |
| bedrooms | 2.750180 | 1 | 1.658367 |
| city_area | 1.126735 | 4 | 1.015027 |
| instant_bookable | 1.053377 | 1 | 1.026342 |
| availability_30 | 1.112215 | 1 | 1.054616 |
Again, the model has favorable diagnostic plots and VIFs, meaning that the selected variables seem to explain price differences for AirBnB listings in Stockholm quite well.
One last variable we want to add to the model is the interesting characteristic about the host we have identified in our explanatory data analysis earlier, namely host_is_superhost. Our analysis above showed that it is likely that superhosts have a premium on their property listing prices. In terms of improvement to model 5, we will discard the variable instant_bookable which was not significant.
#create model 6 with host_is_superhost as additional variable and remove instant_bookable
model6 <- lm(log_price_4_nights ~
room_type +
top_reviewed +
accommodates +
bedrooms +
city_area +
availability_30 +
host_is_superhost +
host_response_time,
data = listings_condensed)
#display results of model
model6 %>%
tidy() %>%
kable(caption = "Model 6 coefficients",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 7.9163397 | 0.0587247 | 134.8042010 | 0.0000000 |
| room_typeHotel room | -0.5096782 | 0.1273024 | -4.0036794 | 0.0000667 |
| room_typePrivate room | -0.4877944 | 0.0348112 | -14.0125742 | 0.0000000 |
| room_typeShared room | -1.4870535 | 0.1164534 | -12.7695135 | 0.0000000 |
| top_reviewedyes | 0.0793885 | 0.0263008 | 3.0184853 | 0.0026014 |
| accommodates | 0.0264284 | 0.0106356 | 2.4849055 | 0.0131126 |
| bedrooms | 0.2639284 | 0.0203294 | 12.9826137 | 0.0000000 |
| city_areaNorth | -0.5512042 | 0.0535009 | -10.3027133 | 0.0000000 |
| city_areaSouth | -0.3562631 | 0.0337968 | -10.5413393 | 0.0000000 |
| city_areaVerySouth | -0.3449322 | 0.0510914 | -6.7512835 | 0.0000000 |
| city_areaWest | -0.4070585 | 0.0490838 | -8.2931357 | 0.0000000 |
| availability_30 | 0.0104001 | 0.0013073 | 7.9551791 | 0.0000000 |
| host_is_superhostTRUE | 0.0199346 | 0.0302811 | 0.6583166 | 0.5104782 |
| host_response_timewithin a day | 0.0353750 | 0.0553374 | 0.6392610 | 0.5227918 |
| host_response_timewithin a few hours | -0.0347103 | 0.0571152 | -0.6077240 | 0.5435012 |
| host_response_timewithin an hour | -0.0356862 | 0.0520065 | -0.6861862 | 0.4927464 |
model6 %>%
glance() %>%
kable(caption = "Model 6 fit",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5670535 | 0.5609037 | 0.4061188 | 92.2067 | 0 | 15 | -547.0524 | 1128.105 | 1212.719 | 174.1687 | 1056 | 1072 |
The regression result goes against our intuition as we could not improve our model with the above steps. The variable host_is_superhost could not be significant because being a super host does not have anything to with the prices of the properties they are listing. According to AirBnB’s policies, becoming a superhost is dependent on completing a certain number of trips and maintaining a 4.8 overall rating. These standards do not affect the prices of the properties being listed, so it is understandable that being a superhost does not directly correlate to a property’s price, thus making host_is_superhost an insignificant variable.
If we discard host_is_superhost from our sixth model, we have our final model that should have explanatory variables that are all significant and an adjusted \(R^2\) higher than what we have in model 6. Before we do that however, we will check the results of the model that includes all explanatory variables we thought to be worth looking at and compare the results to our most recent model.
#compare recent model to model that includes all variables
model_all <- lm(log_price_4_nights ~ .-price_4_nights -price, data = listings_condensed)
#display results of model with all variables
model_all %>%
tidy() %>%
kable(caption = "Model_all coefficients",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 7.9009876 | 0.1462332 | 54.0300710 | 0.0000000 |
| prop_type_simplifiedPrivate room in rental unit | -0.1082636 | 0.0799531 | -1.3540887 | 0.1760031 |
| prop_type_simplifiedEntire residential home | 0.0585938 | 0.0649980 | 0.9014711 | 0.3675472 |
| prop_type_simplifiedEntire condominium (condo) | 0.0894142 | 0.0549080 | 1.6284371 | 0.1037360 |
| prop_type_simplifiedOther | 0.0722663 | 0.0504422 | 1.4326566 | 0.1522574 |
| accommodates | 0.0517085 | 0.0132175 | 3.9121256 | 0.0000974 |
| room_typeHotel room | -0.4913995 | 0.1479473 | -3.3214490 | 0.0009268 |
| room_typePrivate room | -0.3967333 | 0.0710746 | -5.5819281 | 0.0000000 |
| room_typeShared room | -1.4245197 | 0.1365761 | -10.4302258 | 0.0000000 |
| bathrooms | -0.0330947 | 0.0222334 | -1.4885126 | 0.1369198 |
| bedrooms | 0.2661769 | 0.0217539 | 12.2358399 | 0.0000000 |
| beds | -0.0327525 | 0.0125793 | -2.6036786 | 0.0093547 |
| number_of_amenities | -0.0000373 | 0.0011482 | -0.0324719 | 0.9741020 |
| availability_30 | 0.0105420 | 0.0012927 | 8.1551527 | 0.0000000 |
| number_of_reviews | -0.0002507 | 0.0002724 | -0.9200893 | 0.3577401 |
| reviews_per_month | -0.0333734 | 0.0075038 | -4.4475226 | 0.0000096 |
| review_scores_rating | 0.0167637 | 0.0307501 | 0.5451589 | 0.5857615 |
| reviews_30_plusyes | -0.0136396 | 0.0351049 | -0.3885370 | 0.6976984 |
| top_reviewedyes | 0.0503593 | 0.0298970 | 1.6844241 | 0.0924007 |
| instant_bookableTRUE | 0.0102876 | 0.0321627 | 0.3198612 | 0.7491380 |
| city_areaNorth | -0.6467286 | 0.0582072 | -11.1108026 | 0.0000000 |
| city_areaSouth | -0.3996639 | 0.0351874 | -11.3581530 | 0.0000000 |
| city_areaVerySouth | -0.4061064 | 0.0521828 | -7.7823857 | 0.0000000 |
| city_areaWest | -0.4439933 | 0.0502996 | -8.8269842 | 0.0000000 |
| host_is_superhostTRUE | 0.0451205 | 0.0315390 | 1.4306272 | 0.1528383 |
| host_response_timewithin a day | 0.0479117 | 0.0561034 | 0.8539899 | 0.3933078 |
| host_response_timewithin a few hours | -0.0084918 | 0.0585454 | -0.1450469 | 0.8847020 |
| host_response_timewithin an hour | -0.0021575 | 0.0556909 | -0.0387410 | 0.9691044 |
model_all %>%
glance() %>%
kable(caption = "Model_all fit",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5861151 | 0.575339 | 0.3991654 | 54.38992 | 0 | 27 | -518.9081 | 1095.816 | 1239.967 | 165.2283 | 1037 | 1065 |
#calculate VIF for model_all
car::vif(model_all) %>%
kable(caption = "VIF of model_all",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
fixed_thead = T) %>%
row_spec(0, bold = T) %>%
scroll_box(height = "250px")| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 9.832667 | 4 | 1.330712 |
| accommodates | 4.328225 | 1 | 2.080439 |
| room_type | 7.730494 | 3 | 1.406159 |
| bathrooms | 1.658046 | 1 | 1.287651 |
| bedrooms | 3.260323 | 1 | 1.805636 |
| beds | 3.284437 | 1 | 1.812302 |
| number_of_amenities | 1.286185 | 1 | 1.134101 |
| availability_30 | 1.193215 | 1 | 1.092344 |
| number_of_reviews | 1.912959 | 1 | 1.383097 |
| reviews_per_month | 1.335626 | 1 | 1.155693 |
| review_scores_rating | 1.438250 | 1 | 1.199270 |
| reviews_30_plus | 1.812085 | 1 | 1.346137 |
| top_reviewed | 1.467439 | 1 | 1.211379 |
| instant_bookable | 1.318067 | 1 | 1.148071 |
| city_area | 1.595913 | 4 | 1.060172 |
| host_is_superhost | 1.270479 | 1 | 1.127155 |
| host_response_time | 1.672694 | 3 | 1.089522 |
We can see from the output above that the \(R^2\) value is not very much higher for the model that includes all variables, but the VIF score for prop_type_simplified and room_typ is well above 5. We have accounted for this in our prior models already. One variable that suddenly stands out but was insignicant in prior models is reviews_per_month that gives the number of reviews the listing get per month. Given that our model 6 (less the host_is_superhost variable) controls for other variables, let us see whether there is now a significant effect for reviews_per_month and see whether this improves our final model 7.
#create model 7 which is model 6 less host_is_superhost_variable including
model7 <- lm(log_price_4_nights ~
room_type +
top_reviewed +
accommodates +
bedrooms +
city_area +
availability_30 +
reviews_per_month,
data = listings_condensed)
#display results of model
model7 %>%
tidy() %>%
kable(caption = "Model 7 coefficients",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 7.9637975 | 0.0329923 | 241.383469 | 0.0000000 |
| room_typeHotel room | -0.5546377 | 0.1268577 | -4.372124 | 0.0000132 |
| room_typePrivate room | -0.5081277 | 0.0298614 | -17.016220 | 0.0000000 |
| room_typeShared room | -1.2041019 | 0.0990205 | -12.160121 | 0.0000000 |
| top_reviewedyes | 0.0668962 | 0.0223505 | 2.993056 | 0.0028084 |
| accommodates | 0.0348981 | 0.0095304 | 3.661785 | 0.0002594 |
| bedrooms | 0.2488663 | 0.0175726 | 14.162168 | 0.0000000 |
| city_areaNorth | -0.6262008 | 0.0460162 | -13.608277 | 0.0000000 |
| city_areaSouth | -0.3974887 | 0.0291337 | -13.643606 | 0.0000000 |
| city_areaVerySouth | -0.3377680 | 0.0412109 | -8.196083 | 0.0000000 |
| city_areaWest | -0.4359752 | 0.0449029 | -9.709290 | 0.0000000 |
| availability_30 | 0.0119586 | 0.0010902 | 10.969075 | 0.0000000 |
| reviews_per_month | -0.0381801 | 0.0062461 | -6.112643 | 0.0000000 |
model7 %>%
glance() %>%
kable(caption = "Model 7 fit",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5697367 | 0.5661954 | 0.4097766 | 160.8852 | 0 | 12 | -768.3873 | 1564.775 | 1638.886 | 244.8228 | 1458 | 1471 |
From the output above, we can see that adding the reviews_per_month variable, which now is significant, improves our \(R^2\) value further and brings it to 56.62%. However, we can see that the number of reviews per month decreases the price, which might seem counterintuitive. We would explain this relationship in the way that more reviews also mean that more somewhat negative experiences or remarks are mentioned and that people looking at the apartment/house have thus more information about the accommodation. With this, the hosts might be somewhat forced to slightly decrease the price as they cannot pretend that their offer is 100% perfect anymore. Due to its significance, we will keep the number of reviews_per_month as an explanatory variable, but check for multicollinearity and the distribution of residuals first.
#display residuals
autoplot(model7) +
theme_bw()#calculate VIF for model 7
car::vif(model7) %>%
kable(caption = "VIF of model 7",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T)| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_type | 1.253192 | 3 | 1.038332 |
| top_reviewed | 1.062689 | 1 | 1.030868 |
| accommodates | 2.737423 | 1 | 1.654516 |
| bedrooms | 2.800181 | 1 | 1.673374 |
| city_area | 1.145674 | 4 | 1.017144 |
| availability_30 | 1.112222 | 1 | 1.054619 |
| reviews_per_month | 1.080958 | 1 | 1.039691 |
Even though our residuals are not perfectly normally distributed, this model has a result that we are very pleased with compared to the other models. Residuals are random as shown in the residual vs. fitted graph and VIF scores are all well below 5, meaning that we avoid collinearity.
To summarize what we have done with the AirBnB dataset, we will use huxreg to display all models we have produced and see their significance, adjusted \(R^2\) and the Residual Standard Error.
#use huxreg to produce the summary table
huxreg(model1, model2, model3, model4, model5, model6, model7, model_all,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
number_format = "%.3f",
bold_signif = 0.05) #%>% | (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 8.340 *** | 8.360 *** | 8.033 *** | 8.033 *** | 7.910 *** | 7.916 *** | 7.964 *** | 7.901 *** |
| (0.099) | (0.096) | (0.043) | (0.035) | (0.032) | (0.059) | (0.033) | (0.146) | |
| prop_type_simplifiedPrivate room in rental unit | -0.706 *** | -0.107 | -0.108 | |||||
| (0.042) | (0.084) | (0.080) | ||||||
| prop_type_simplifiedEntire residential home | 0.442 *** | 0.443 *** | 0.059 | |||||
| (0.059) | (0.057) | (0.065) | ||||||
| prop_type_simplifiedEntire condominium (condo) | 0.044 | 0.045 | 0.089 | |||||
| (0.063) | (0.061) | (0.055) | ||||||
| prop_type_simplifiedOther | -0.254 *** | 0.065 | 0.072 | |||||
| (0.038) | (0.048) | (0.050) | ||||||
| number_of_reviews | -0.000 | -0.000 | -0.000 | |||||
| (0.000) | (0.000) | (0.000) | ||||||
| review_scores_rating | 0.008 | 0.003 | 0.017 | |||||
| (0.021) | (0.020) | (0.031) | ||||||
| room_typeHotel room | -0.478 *** | -0.170 | -0.345 ** | -0.547 *** | -0.510 *** | -0.555 *** | -0.491 *** | |
| (0.116) | (0.159) | (0.132) | (0.129) | (0.127) | (0.127) | (0.148) | ||
| room_typePrivate room | -0.599 *** | -0.571 *** | -0.478 *** | -0.505 *** | -0.488 *** | -0.508 *** | -0.397 *** | |
| (0.074) | (0.034) | (0.031) | (0.030) | (0.035) | (0.030) | (0.071) | ||
| room_typeShared room | -1.209 *** | -1.032 *** | -1.043 *** | -1.167 *** | -1.487 *** | -1.204 *** | -1.425 *** | |
| (0.131) | (0.121) | (0.103) | (0.100) | (0.116) | (0.099) | (0.137) | ||
| reviews_30_plusyes | 0.008 | -0.014 | ||||||
| (0.029) | (0.035) | |||||||
| top_reviewedyes | 0.067 ** | 0.051 * | 0.075 *** | 0.079 ** | 0.067 ** | 0.050 | ||
| (0.026) | (0.023) | (0.023) | (0.026) | (0.022) | (0.030) | |||
| reviews_per_month | -0.018 * | -0.038 *** | -0.033 *** | |||||
| (0.007) | (0.006) | (0.008) | ||||||
| accommodates | 0.060 *** | 0.038 *** | 0.028 ** | 0.026 * | 0.035 *** | 0.052 *** | ||
| (0.013) | (0.010) | (0.010) | (0.011) | (0.010) | (0.013) | |||
| bathrooms | -0.030 | -0.033 | ||||||
| (0.023) | (0.022) | |||||||
| bedrooms | 0.196 *** | 0.245 *** | 0.263 *** | 0.264 *** | 0.249 *** | 0.266 *** | ||
| (0.021) | (0.018) | (0.018) | (0.020) | (0.018) | (0.022) | |||
| beds | -0.013 | -0.033 ** | ||||||
| (0.013) | (0.013) | |||||||
| number_of_amenities | -0.004 *** | -0.002 | -0.000 | |||||
| (0.001) | (0.001) | (0.001) | ||||||
| city_areaNorth | -0.559 *** | -0.593 *** | -0.551 *** | -0.626 *** | -0.647 *** | |||
| (0.048) | (0.047) | (0.054) | (0.046) | (0.058) | ||||
| city_areaSouth | -0.358 *** | -0.371 *** | -0.356 *** | -0.397 *** | -0.400 *** | |||
| (0.031) | (0.029) | (0.034) | (0.029) | (0.035) | ||||
| city_areaVerySouth | -0.284 *** | -0.313 *** | -0.345 *** | -0.338 *** | -0.406 *** | |||
| (0.043) | (0.042) | (0.051) | (0.041) | (0.052) | ||||
| city_areaWest | -0.411 *** | -0.415 *** | -0.407 *** | -0.436 *** | -0.444 *** | |||
| (0.047) | (0.045) | (0.049) | (0.045) | (0.050) | ||||
| instant_bookableTRUE | -0.047 | 0.010 | ||||||
| (0.026) | (0.032) | |||||||
| availability_30 | 0.012 *** | 0.010 *** | 0.012 *** | 0.011 *** | ||||
| (0.001) | (0.001) | (0.001) | (0.001) | |||||
| host_is_superhostTRUE | 0.020 | 0.045 | ||||||
| (0.030) | (0.032) | |||||||
| host_response_timewithin a day | 0.035 | 0.048 | ||||||
| (0.055) | (0.056) | |||||||
| host_response_timewithin a few hours | -0.035 | -0.008 | ||||||
| (0.057) | (0.059) | |||||||
| host_response_timewithin an hour | -0.036 | -0.002 | ||||||
| (0.052) | (0.056) | |||||||
| #observations | 1646 | 1646 | 1461 | 1471 | 1471 | 1072 | 1471 | 1065 |
| R squared | 0.203 | 0.260 | 0.437 | 0.524 | 0.560 | 0.567 | 0.570 | 0.586 |
| Adj. R Squared | 0.200 | 0.256 | 0.433 | 0.520 | 0.556 | 0.561 | 0.566 | 0.575 |
| Residual SE | 0.542 | 0.523 | 0.468 | 0.431 | 0.415 | 0.406 | 0.410 | 0.399 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||||
#kable(caption = "Comparison of models",
# align = "l",
# digits = 1) %>%
#kable_classic(c("striped", "hover", "condensed"),
# html_font = "Arial",
# fixed_thead = T) %>%
#row_spec(0, bold = T) %>%
#scroll_box(width = "100%", height = "400px")We can see from the above table that model 7 has the highest adjusted \(R^2\) and the second-to-lowest Residual Standard Error if we exclude the model with all variables we regarded as interesting. Since model 7 gives us best result in terms of coefficient significance and differs only very slightly from the \(R^2\) and RSE metrics of the model with all variables, we believe it is best model for explanatory and predictive purposes.
Our final model is thus as followed: [Will change this formula when all others are set]
\[
\begin{aligned}
\widehat{log\_price\_4\_nights} = & 7.961 - 0.565 \cdot room\_type_{Hotelroom} - 0.511 \cdot room\_type_{Privateroom}\\
& - 1.214 \cdot room\_type_{Sharedroom} + 0.072 \cdot top\_reviewed_{yes} - \\
& + 0.036 \cdot accommodates + 0.246 \cdot bedrooms\\
& - 0.629 \cdot city\_area_{North} - 0.400 \cdot city\_area_{South}\\
& - 0.341 \cdot city\_area_{VeryNorth} - 0.437 \cdot city\_area_{West}\\
& + 0.012 \cdot availability\_30 - 0.039 \cdot reviews\_per\_month
\end{aligned}
\] Note that for three categorical variables in our model, namel room_type, top_reviewed, and city_area, the baseline values are respectively entire home/apt, no, and Central. This means that our model assumes an entire home/apartment that has not a rating of 4.8 or more and is in central Stockholm as the baseline.
To avoid overfitting and validate the above model, we will now conduct an out-of-sample testing on it. We will split the dataset into two subsets and use the training subset to fit our model 7 to create new estimates for coefficients. We use the new model 7 on the testing subset and see how much the Root Mean Squared Error changes.
#set the random seed for reproducibility
set.seed(1234)
#split into two subsets, one for training and the other for testing
train_test_split <- initial_split(listings_condensed, prop = 0.75)
listings_train <- training(train_test_split)
listings_test <- testing(train_test_split)
#train model7 with data from training dataset
model7trained <- lm(log_price_4_nights ~
room_type +
top_reviewed +
accommodates +
bedrooms +
city_area +
availability_30 +
reviews_per_month,
data = listings_train)
msummary(model7trained) Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.986161 0.039304 203.188 < 2e-16 ***
room_typeHotel room -0.738746 0.150345 -4.914 1.03e-06 ***
room_typePrivate room -0.530160 0.034785 -15.241 < 2e-16 ***
room_typeShared room -1.235301 0.106765 -11.570 < 2e-16 ***
top_reviewedyes 0.072907 0.026181 2.785 0.00545 **
accommodates 0.024938 0.010909 2.286 0.02245 *
bedrooms 0.254313 0.020520 12.393 < 2e-16 ***
city_areaNorth -0.616268 0.052660 -11.703 < 2e-16 ***
city_areaSouth -0.394287 0.034248 -11.513 < 2e-16 ***
city_areaVerySouth -0.321489 0.047898 -6.712 3.09e-11 ***
city_areaWest -0.437844 0.052296 -8.372 < 2e-16 ***
availability_30 0.012014 0.001282 9.375 < 2e-16 ***
reviews_per_month -0.035598 0.007525 -4.731 2.53e-06 ***
Residual standard error: 0.4148 on 1083 degrees of freedom
(341 observations deleted due to missingness)
Multiple R-squared: 0.5715, Adjusted R-squared: 0.5667
F-statistic: 120.4 on 12 and 1083 DF, p-value: < 2.2e-16
#calculate RMSE of training data data fitted with model 7 with log price
rmse_train <- listings_train %>%
mutate(predictions = predict(model7trained,.)) %>%
select(predictions,log_price_4_nights) %>%
#filter out those rows where predicted value is NA
filter(!is.na(predictions)) %>%
mutate(squared_error = (predictions - log_price_4_nights)^2) %>%
summarise(rmse = sqrt(mean(squared_error))) %>%
pull()
rmse_train[1] 0.4123676
#calculate the RMSE of the testing data fitted with model 7 with log price
rmse_test <- listings_test %>%
mutate(predictions = predict(model7trained,.)) %>%
select(predictions,log_price_4_nights) %>%
#filter out those rows where predicted value is NA
filter(!is.na(predictions)) %>%
mutate(squared_error = (predictions - log_price_4_nights)^2) %>%
summarise(rmse = sqrt(mean(squared_error))) %>%
pull()
rmse_test[1] 0.3972884
#calculate the RMSE of the training data fitted with model 7 with normal price
rmse_train_nolog <- listings_train %>%
mutate(predictions = exp(predict(model7trained,.))) %>%
select(predictions, price_4_nights) %>%
#filter out those rows where predicted value is NA
filter(!is.na(predictions)) %>%
mutate(squared_error = (predictions - price_4_nights)^2) %>%
summarise(rmse = sqrt(mean(squared_error))) %>%
pull()
rmse_train_nolog[1] 2919.272
#calculate the RMSE of the testing data fitted with model 7 with normal price
rmse_test_nolog <- listings_test %>%
mutate(predictions = exp(predict(model7trained,.))) %>%
select(predictions, price_4_nights) %>%
#filter out those rows where predicted value is NA
filter(!is.na(predictions)) %>%
mutate(squared_error = (predictions - price_4_nights)^2) %>%
summarise(rmse = sqrt(mean(squared_error))) %>%
pull()
rmse_test_nolog[1] 2927.192
#calculate the difference of RMSE between training and testing data with log price
print(rmse_test - rmse_train)[1] -0.01507916
##calculate the difference of RMSE between training and testing data with log price
print(rmse_test_nolog - rmse_train_nolog)[1] 7.919808
From the above output we can see that our model achieves a similar \(R^2\) metric to the original model when trained with the training dataset that contains 75% of the variables of the original dataset we used to build our model 7. We can see that the RMSE of the test set is even a little bit smaller than that of the training dataset. While this might be a sign of underfitting, the difference is negligible with only 0.015 and close to zero. Calculating RMSE with the normal prices (i.e. taking exponential of predicted log prices and comparing them to price_4_nights), results again in very similar RMSEs, whereby this time the RMSE of the test set is slightly higher with 7.92SEK, which is very small relative to the mean price. Therefore, we conclude that the RMSE of our training and test set are very similar and that the relative performance of our model is strong, meaning that we have not overfitted the data.
Finally, we want to use our best model to predict the price of a 4-night stay in Stockholm. Suppose we are planning to visit the city to over reading week, and we want to stay in an AirBnB. We first find AirBnB’s in Stockholm that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90% (i.e. 4.5 or above). We then use our best model to predict the total cost to stay at this AirBnB for 4 nights.
Let’s create a new dataset that meets our demand from listings_condensed, using the four conditions above to filter.
#create new dataset based on outlined conditions
target_listings <- listings_condensed %>%
filter(prop_type_simplified == "Private room in rental unit",
room_type == "Private room",
number_of_reviews >= 10,
top_reviewed == "yes")
head(target_listings) %>%
kable(caption = "First 6 rows of our target dataset",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%")| prop_type_simplified | accommodates | room_type | bathrooms | bedrooms | beds | number_of_amenities | availability_30 | number_of_reviews | reviews_per_month | review_scores_rating | reviews_30_plus | top_reviewed | instant_bookable | city_area | host_is_superhost | price | host_response_time | price_4_nights | log_price_4_nights |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Private room in rental unit | 2 | Private room | 1.0 | 1 | 2 | 23 | 26 | 319 | 2.64 | 4.85 | yes | yes | TRUE | Central | TRUE | 643 | within an hour | 2572 | 7.852439 |
| Private room in rental unit | 2 | Private room | 1.0 | 1 | 0 | 27 | 29 | 104 | 0.95 | 4.85 | yes | yes | TRUE | North | FALSE | 360 | NA | 1440 | 7.272398 |
| Private room in rental unit | 2 | Private room | 1.0 | 1 | 1 | 27 | 0 | 593 | 5.34 | 4.91 | yes | yes | TRUE | Central | TRUE | 490 | within an hour | 1960 | 7.580700 |
| Private room in rental unit | 2 | Private room | 1.5 | 1 | 1 | 33 | 18 | 88 | 1.29 | 4.92 | yes | yes | FALSE | Central | TRUE | 695 | within a few hours | 2780 | 7.930206 |
| Private room in rental unit | 2 | Private room | 1.0 | 1 | 1 | 28 | 8 | 145 | 1.88 | 4.95 | yes | yes | FALSE | Central | TRUE | 943 | within an hour | 3772 | 8.235361 |
| Private room in rental unit | 2 | Private room | 1.0 | 1 | 2 | 19 | 12 | 92 | 1.05 | 4.82 | yes | yes | FALSE | VerySouth | TRUE | 450 | within an hour | 1800 | 7.495542 |
There are 73 listings that suit our preferences. We can now use predict() to generate price estimates with the help of our model. Alternatively we can also use broom::augment(). Here we use the latter method to add confidence interval more conveniently.
#predict log_price_4_nights
target_listings <- augment(model7,
newdata = target_listings,
se_fit = T,
interval = "confidence")
#convert log price to price
target_listings <- target_listings %>%
mutate(expected_price = exp(.fitted),
expected_price_lower = exp(.lower),
expected_price_upper = exp(.upper))
#show expected price of first 6 observations
target_listings %>%
select(expected_price, expected_price_lower, expected_price_upper) %>%
kable(caption = "Our target prices and confidence intervals",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial",
fixed_thead = T) %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%", height = "400px") | expected_price | expected_price_lower | expected_price_upper |
|---|---|---|
| 3138.051 | 2922.790 | 3369.166 |
| 1854.846 | 1673.466 | 2055.884 |
| 2074.233 | 1920.089 | 2240.752 |
| 3002.593 | 2822.965 | 3193.650 |
| 2604.820 | 2457.333 | 2761.160 |
| 2011.981 | 1842.344 | 2197.236 |
| 2407.239 | 2266.578 | 2556.629 |
| 1632.220 | 1518.916 | 1753.976 |
| 2550.760 | 2401.061 | 2709.793 |
| 1589.905 | 1476.778 | 1711.699 |
| 1650.138 | 1541.170 | 1766.810 |
| 3212.558 | 2993.115 | 3448.090 |
| 2463.054 | 2317.195 | 2618.094 |
| 2516.201 | 2338.960 | 2706.873 |
| 3283.915 | 3063.804 | 3519.839 |
| 2916.810 | 2736.468 | 3109.037 |
| 2380.764 | 2240.515 | 2529.794 |
| 2469.645 | 2323.145 | 2625.385 |
| 2293.330 | 2154.190 | 2441.458 |
| 1819.141 | 1663.403 | 1989.460 |
| 2507.651 | 2356.908 | 2668.036 |
| 6221.866 | 5694.347 | 6798.253 |
| 1633.215 | 1525.130 | 1748.959 |
| 2083.625 | 1907.513 | 2275.996 |
| 1662.787 | 1553.000 | 1780.334 |
| 1676.172 | 1565.372 | 1794.815 |
| 2558.843 | 2413.888 | 2712.503 |
| 2418.843 | 2233.140 | 2619.989 |
| 2506.694 | 2356.069 | 2666.949 |
| 1119.612 | 869.266 | 1442.057 |
| 2425.724 | 2282.965 | 2577.410 |
| 1691.602 | 1579.445 | 1811.724 |
| 1676.643 | 1518.586 | 1851.150 |
| 2459.295 | 2313.790 | 2613.951 |
| 2409.110 | 2267.435 | 2559.638 |
| 1602.203 | 1455.694 | 1763.459 |
| 1895.781 | 1764.829 | 2036.451 |
| 1642.595 | 1534.051 | 1758.819 |
| 1508.324 | 1396.155 | 1629.505 |
| 1415.868 | 1292.310 | 1551.240 |
| 1402.980 | 1269.699 | 1550.250 |
| 2478.146 | 2330.776 | 2634.834 |
| 2499.820 | 2329.700 | 2682.364 |
| 2347.369 | 2208.119 | 2495.400 |
| 2380.072 | 2200.137 | 2574.722 |
| 2321.521 | 2182.551 | 2469.340 |
| 1584.278 | 1476.864 | 1699.505 |
| 2244.296 | 2086.609 | 2413.898 |
| 1613.998 | 1506.624 | 1729.024 |
| 2410.030 | 2268.299 | 2560.617 |
| 2511.484 | 2360.263 | 2672.394 |
| 3176.711 | 2978.797 | 3387.775 |
| 4366.831 | 4019.189 | 4744.542 |
| 2830.519 | 2666.650 | 3004.458 |
| 1638.211 | 1529.891 | 1754.200 |
| 1302.793 | 1180.695 | 1437.519 |
| 1263.120 | 1143.679 | 1395.036 |
| 2599.908 | 2347.841 | 2879.038 |
| 2294.870 | 2062.603 | 2553.292 |
| 1783.405 | 1628.204 | 1953.399 |
| 1648.503 | 1534.893 | 1770.523 |
| 1655.186 | 1545.908 | 1772.188 |
| 2536.471 | 2392.200 | 2689.442 |
| 2871.494 | 2705.751 | 3047.388 |
| 1384.483 | 1256.977 | 1524.922 |
| 2459.295 | 2313.790 | 2613.951 |
| 2390.784 | 2250.091 | 2540.275 |
| 1603.065 | 1454.824 | 1766.412 |
| 2500.004 | 2350.187 | 2659.370 |
| 2434.073 | 2290.701 | 2586.420 |
| 1644.855 | 1490.854 | 1814.764 |
| 1289.449 | 1160.107 | 1433.212 |
| 2159.076 | 2013.149 | 2315.582 |
#show summary statistics including CI
favstats(target_listings$expected_price) %>%
mutate(SE = sd/sqrt(n),
t_crit = qt(0.975, n-1),
margin_of_error = t_crit * SE,
lower = mean - margin_of_error,
upper = mean + margin_of_error) %>%
kable(caption = "Statistics on our target prices",
align = "l") %>%
kable_classic(c("striped", "hover", "condensed"),
html_font = "Arial") %>%
row_spec(0, bold = T) %>%
scroll_box(width = "100%") | min | Q1 | median | Q3 | max | mean | sd | n | missing | SE | t_crit | margin_of_error | lower | upper | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1119.612 | 1648.503 | 2321.521 | 2506.694 | 6221.866 | 2227.613 | 754.2488 | 73 | 0 | 88.27814 | 1.993464 | 175.9793 | 2051.634 | 2403.592 |
Our predictions for the price of a 4-night stay have a mean of 2,227 SEK (188 GBP) which is our point estimate and range from 1,059 to 7,518 SEK. The appropriate 95 % confidence interval for our prediction ranges from 2051.63 SEK to 2403.59 SEK, meaning that we can be 95% confident that the price for a 4-night stay (price_4_nights) will lie between these two numbers.
We also create a line and ribbon plot to visualize our general prediction on selected listings.
#arrange and create an ID in ascending order
target_listings <- target_listings %>%
arrange(expected_price) %>%
mutate(ID = row_number())
#plot distribution of predicted prices
ggplot(target_listings, aes(x = ID)) +
geom_line(aes(y = expected_price)) +
geom_ribbon(aes(ymin = expected_price_lower,
ymax = expected_price_upper),
fill = "grey",
alpha = 0.5) +
geom_hline(yintercept = mean(target_listings$expected_price),
color = "blue")+
labs(title = "Price predictions for four-night stay in Stockholm",
x = NULL,
y = "Price") +
theme_bw() +
NULLAs is the case usually with statistics, we conclude that less is more. In our final model we conclude that there are 5 categories of variables that explain the (log of) price of an AirBnB apartment in Stockholm, namely: the room type, its reviews, how many people it can accommodate, its availability, and its location. Our model provides us with an explanation of c.60% of the variance of the price, a strong relationship. A thorough statistical analysis led us to insights we would not have expected, such as the fact that a large number of reviews might impact the average price negatively, or that having a superhost is not important for estimating the price. Conversely, we verified some of our intuitions, such as that the more people a flat can accommodate, the more expensive it will be. All of our process led us to a model that a couple that wants to stay in Stockholm for four days could use and have an estimation of the price it would have to pay per day with reasonable effectiveness, as shown in our out-of-sample testing.
However, as famous late statistician George Box said, all models are wrong but some are useful. Although we are confident that our final model can do a good job at estimating AirBnB prices in Stockholm, there is certainly room for improvement. We know that location is always going to be a very significant factor when looking for a place to stay. For instance, we believe that it would be valuable to delve deeper into each of the 14 original neighbourhoods and see, by longitude and latitude, where the closest metro stations and other public transport stations are. This could be expanded even further if we had data about users’ habits when on holiday, for example whether they like to visit museums or perhaps they enjoy fancy restaurants. The location of the AirBnB will be crucial with regards to these information points.
In addition, traveling is seasonal and especially in Stockholm that is known for its dark and cold winters. People might not want to travel to a Nordic country in the winter, thereby decreasing prices but love to travel there in the summer when the sun almost never sets. Therefore, there is also room for future work when it comes to accounting for seasonality in predicting the price of an AirBnB in Stockholm. This could be done with time-series data for the same apartments and see how the demand and in turn price of listings changes.
Another idea we had but could not realize due to time constraints would be the use of natural language processing (NLP) for some of the character strings we were given. E.g., we could filter for the number of positive words mentioned in the description of each listing or just simply counting the number of words therein and see whether this leads to a difference in the price of the listing. Another thing that is hard to account for but is likely to have an influence is also the quality of pictures. Humans are visual creatures and appealing and high-quality pictures are likely to lead them to pay higher prices, or at least for the host to offer their apartments at higher prices. This could be another thing to test for when predicting or explaining the price differences for AirBnBs in Stockholm.
All in all, we are satisfied with our result, as the fact that we have generated more questions solidifies our thorough analysis of the data set. Should a couple thus decide to travel to Stockholm and wants to book an AirBnB, we are confident to give them a prediction for the price they will pay based on our model. However, this should only be used as a first indication and rough guideline and not taken as guaranteed.